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The nature of semistructured data in web collections is evolving. Even when XML web 
documents are valid with regard to a schema, the actual structure of such documents 
exhibits significant variations across collections for several reasons: an XML schema 
may be very lax (e.g., to accommodate the flexibility needed to represent collections 
of documents in RSq^ feeds), a schema may be large and different subsets used for 
different documents (e.g., this is common in industry standards like UBLrl), or open 
content models may allow arbitrary schemas to be mixed (e.g., RSS extensions like those 
used for podcasting). A schema alone may not provide sufficient information for many 
data management tasks that require knowledge of the actual structure of the collection. 

Web applications (such as processing RSS feeds or web service messages) rely on 
XPath-based data manipulation tools. Web developers need to use XPath queries effec- 
tively on increasingly larger web collections containing hundreds of thousands of XML 
documents. Even when tasks only need to deal with a single document at a time, de- 
velopers benefit from understanding the behaviour of XPath expressions across multiple 
documents (e.g., what will a query return when run over the thousands of hourly feeds 
collected during the last few months?). Dealing with the (highly variable) structure of 
such web collections poses additional challenges. 
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This thesis introduces DescribeX, a powerful framework that is capable of describ- 
ing arbitrarily complex XML summaries of web collections, providing support for more 
efficient evaluation of XPath workloads. DescribeX permits the declarative description 
of document structure using all axes and language constructs in XPath, and generalizes 
many of the XML indexing and summarization approaches in the literature. DescribeX 
supports the construction of heterogenous summaries where different document elements 
sharing a common structure can be declaratively defined and refined by means of path 
regular expressions on axes, or axis path regular expression (AxPREs). DescribeX can 
significantly help in the understanding of both the structure of complex, heterogeneous 
XML collections and the behaviour of XPath queries evaluated on them. 

Experimental results demonstrate the scalability of DescribeX summary refinements 
and stabilizations (the key enablers for tailoring summaries) with multi-gigabyte web 
collections. A comparative study suggests that using a DescribeX summary created from 
a given workload can produce query evaluation times orders of magnitude better than 
using existing summaries. DescribeX's light-weight approach of combining summaries 
with a file-at-a-time XPath processor can be a very competitive alternative, in terms 
of performance, to conventional fully-fledged XML query engines that provide DB-like 
functionality such as security, transaction processing, and native storage. 
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Chapter 1 



Introduction 



XML is widely used as a common format for web accessible data (e.g., hypertext collec- 
tions like Wikipedia) as well as for data exchanged among web applications (e.g., blogs, 
news feeds, podcasts, web services messaging). This data is often referred to as semistruc- 
tured for the lack of a clear separation between data and metadata it represents: tags 
(metadata) and content (data) are mixed together in the same XML file. 

The vast majority of software tools used for managing XML rely on XPath |W3C07] 
as the core dialect for XML querying. Hence, web developers use XPath queries for 
many of the tasks involved in the processing of XML collections. Such collections are 
normally handled one document at a time, whether the document is an individual RSq^ 
file (used by content distributors to deliver to subscribers frequently updated content 
over the Web), a single SOAPn message, or a Wikipedia article in XHTML. 

Even when XML collections have a schema (which can be either a DTEn or an XML 
SchemaQ, the actual structure present in each document may exhibit significant vari- 
ations for several reasons. First, schemas can be very lax. One reason for this is the 



^ http : //www . rss-specif ications . com/ 

^ http://www.w3.org/TR/soap/ 

■^ http://www.w3.org/TR/REC-xml 

^ http : //www . w3 . org/TR/xmlschema-1/ 
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extensive use of the <xsd: choice> construct in XML schemas, which allows optional 
elements to occur any number of times, including zero. Such a construct is very common 
in RSS for instance. Second, a schema can be very large and only subsets are actually 
used in a given instance. This is the situation with several industry specific standards 
that contain hundreds of elements, such as UBLJj or HR-XMI^j UBL and HR-XML are 
standard libraries of XML schemas that support a variety of business processes. UBL is 
designed to handle supply chain transactions such as purchase orders, shipping notices, 
and invoices, whereas HR-XML contains schemas for human resource management such 
as resumes, payroll information, and benefits enrollment. Finally, a schema can be ex- 
tended by using the <xsd:any> XML Schema construct, which allows arbitrary content 
from other schemas to appear under a given element. Such a construct enables different 
user communities to pick and choose how to combine schemas. Consequently, it provides 
great flexibility, but makes it harder to determine the structure of the documents that 
actually appear in a given collection. Examples of the <xsd : any> extensions can be 
found in a wide variety of industry standards, including RSS, UBL and HR-XML. For 
instance, the UBL standard permits a contractor to represent invoice documents that 
include HR-XML TimeCard elements for the contractor employee's time and expenses. 
The actual structure of invoice collections will vary significantly across contractors and 
customers. If an enclosing messaging schema is used, even the UBL and HR-XML frag- 
ments in the document can be replaced by other invoicing and time billing schemas. In 
these scenarios, schemas alone are insufficient for understanding the structure (metadata) 
of the documents in the collection for either writing or optimizing XPath evaluation. 

A developer working with this type of collection faces several challenges. She must 
learn enough about the structure present in the XML collection to be able to write 
meaningful XPath queries. She must also develop an understanding of how the XPath 



^ http : //oasis-open . org/committees/ubl/ 
^ http : //hr-xml . org 
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Figure 1.1: Wikipedia document graph (a) and its incoming path summary (b) 



expressions behave across different documents in the collection. Even when a task deals 
with a single document at a time, the developer needs to extrapolate the behaviour 
of queries over a single document across the entire collection over which the task may 
be repeatedly applied. In this context, understanding the actual metadata of a web 
collection can be a significant barrier, even for collections validated against a schema. 

XML structural summaries are graphs representing relationships between sets of XML 
elements (i.e., extents). Unlike schemas, which prescribe what may and may not occur in 
an instance, summaries provide a description of the metadata that is actually present in a 
given collection. Figure [L1| (a) shows the instance graph of a Wikipedia sample document 
in which nodes correspond to XML elements in the document. Nodes have an id and are 
labeled with their element names. The structure in Figure |1.1| (b) is a typical summary 
that groups together instance nodes with the same incoming label paths. In such a sum- 
mary, two nodes that have the same incoming label path from the root belong to the same 
extent (sets located below each summary node in the figure). For instance, wikilink ele- 
ments appear at the end of three different label paths - article. section. wikilink (in blue). 
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article. section. section. wikilink (in red), and article. par. wikilink (in green). Consequently, 
there are three different wikilink nodes in the summary, one with extent {3, 16}, another 
with extent {6,11}, and a last one with extent {18}. Extents can also be viewed as 
mappings between instance (document) nodes and summary nodes - represented in the 
figure by dashed arrows linking wikilink nodes in the document graph (left) and wikilink 
extents in the incoming path summary (right). An edge {si,Sj) in the summary means 
that at least one node in the extent of Si is the parent of at least one node in the extent 
of Sj. For instance, an edge from node sy to sg means that some figure elements within 
par have caption elements, but not necessarily all of them have (for this document only 
node 12 has a caption element). 

Describing metadata in semistructured collections was a major motivation in one of 
the earliest summary proposals in the literature |NUWC97t IGW97] . Since then, research 
on summaries has focused on query processing, making summaries one of the most studied 
techniques for query evaluation and indexing in XML (and other semistructured) data 
models |MS99l IKBNKn2[ IKSB(;n2l IQLO031 lBCF+n5j . as well as for providing statistics 



useful in XML query estimation and optimization jPG06bj . 

Most of the existing summary proposals define all extents using the same criteria, 
hence creating homogeneous summaries. These summaries are based on common ele- 
ment paths (in some cases limited to length k), including incoming paths (e.g., repre- 
sentative objects [NUWC97|, dataguides [GW97], 1-index |MS99] . ToXin |RMni J. A(k)- 
index |KSBG02] ). both incoming and outgoing paths (e.g., F&B-Index |KBNK02] ). or 
sequences of outgoing paths (e.g.. Skeleton [BCF+05] ). The few examples of hetero- 
geneous summaries that can adapt/change their structure based on a dynamic query 
workload (e.g., APEX |CMS02] . D(k)-index |QLU03| , XSKETCH |PG06bj ) compute the 



extents from statistics and workload information. 

However, none of these proposals can help us to find elements based on order and 



cardinality criteria. Consider again the instance in Figure 1.1 What are the par elements 
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that contain two figures'? How many section elements contain a figure with caption next 
to a tahlel How many of those contain more than one figured These are questions that 
cannot be answered with any of the summaries mentioned above. 

Moreover, since these proposals are algorithmically defined, it is hard to determine 
how they can be used together for processing today's increasingly heterogeneous and 
large web collections effectively. Specifically, the summary information is not defined 
declaratively, limiting the ease with which these summaries can be used within standard 
data management tasks. 

In this thesis, we propose a novel approach for flexibly summarizing the structure of 
metadata actually present in an XML collection. We introduce DescribeX, a framework 
that supports constructing heterogenous summaries, where each set in the partition can 
be defined by means of path regular expressions on axes, or axis path regular expression 
{AxPRE, for short). AxPREs provide the flexibility necessary for declaratively defining 
complex mappings between instance nodes and summary nodes capable of expressing 
order and cardinality, among other properties. Each AxPRE can be specified by the user 
or obtained from any expression in the complete XPath language (all the axes, document 
order, use of parenthesis, etc.). Given an arbitrary XPath expression posed by the user, 
DescribeX can create a partition defined by an AxPRE that captures exactly the struc- 
tural commonality expressed by a query. AxPRE summaries have a unique capability 
that makes them suitable for describing the structure of XML collections: they are the 
first summaries capable of declaratively defining and refining the summary extents using 
a powerful language. In addition, DescribeX summaries express relationships between 
instance nodes that go beyond the traditional parent-child (e.g., next sibling, following, 
preceding, etc.). Last but not least, DescribeX captures most summary proposals in the 
literature by providing a declarative definition for them for the first time. 

This thesis argues that DescribeX can significantly help not only in the understand- 
ing of the structure of large collections of XML documents, but also in the evaluation 
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of XPath queries posed on them. In fact, DescribeX summaries can also be used to 
significantly speed up (and scale up) XPath evaluation over existing file-at-a-time tools, 
enabling fast exploration of the results of XPath workloads on large collections. The ex- 
perimental results demonstrate that using a summary created from a given workload can 
produce query evaluation times that are two orders of magnitude better than using exist- 
ing summaries (in particular, summaries on incoming paths like 1-index |MS99j . APEX 
|(MSn2j . A(k)-index JKSBG02], and D(k)-index |QLO03| ). The experiments also vali- 
date that DescribeX summaries allow file-at-a-time XPath processors to be a competitive 
light-weight alternative (in terms of performance) to conventional DB-like XML query 
engines supporting additional functionality such as security, transaction processing, and 
native storage. 



DescribeX also has applications to helping a user write and understand XPath queries 
on large XML collections. Several software tools have been developed to help XPath users 
debug query expressions (e.g.. Oxygen XML Editoiy Altova XMLSp}£| etc.) A recent 
research project includes a tool, XPlainer-Eclipse |CLR07j . that provides visual explana- 
tions of XPath expressions. An explanation returns precisely the nodes in a document 
that contribute to the answer, a useful debugging technique. However, the main limi- 
tation of traditional XPath debugging tools in the context of large XML collections is 
that they provide debugging mechanisms only for a single document. Understanding 
queries over collections containing thousands of documents (or even 650,000 documents, 
like in the Wikipedia XML Corpus |DG06j ) using these tools can be an impractical and 
very time-consuming task. DescribeX provides an important foundation on which such 
a large-scale XML collection understanding tool could be built, as evidenced by the 
DescribeX-Echpse tool presented in Appendix [Bj 



^ http://www.oxygenxml.com/ 
^ http://www.altova.com/ 
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1.1 Major contributions 

This thesis identifies the growing need for describing the structure of web collections 
(encoded in XML) using mechanisms that go beyond providing one or more schemas. 
We propose the use of highly customizable summaries that represent the actual structure 
of metadata labels as used in a given collection. The following are the major contributions 
of this thesis. 

1.1.1 AxPRE summaries 

AxPRE summaries rely on the novel concept of a summary descriptor (SD). Traditional 
summaries consist of a labeled graph that describes the label paths in the instance (which 
we call an SD graph) together with an extent relation between summary nodes and sets 
of instance nodes. An SD incorporates three key original features: 

A description of the neighborhood of a node expressed by path regular ex- 
pressions on axes (i.e., binary relations between nodes), AxPREs for short 
(Chapter Isl). AxPREs are evaluated on an axis graphs which is an abstract represen- 
tation of the XPath data model |W3C07j extended with edges that represent XPath axis 
binary relations. Edges are labeled by axis names and nodes are labeled by element or 
attribute names (including namespaces), or by new labels defined using XPath. 

Given an axis graph A, an AxPRE a applied to a node f in ^ returns an AxPRE 
neighbourhood of v which provides a description of the subgraph local to v that satisfies 
a. The AxPRE neighbourhood of f by a is computed by intersecting the automaton 
constructed from the axis graph and the automaton accepting the language generated by 
the AxPRE and all its prefixes. 

The AxPRE neighbourhood of a node v is used to determine to which equivalence 
class V belongs. That is, if two nodes in A have similar AxPRE neighbourhoods (i.e. 
they cannot be distinguished by a), they belong to the same equivalence class. This 
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way, an AxPRE can be used to define a partition of nodes in A in wliicli eacli set is tlie 
extent of a node s in the SD. The notion of similarity we use is the famihar notion of 
bisimulation [PT87J . 

The use of AxPREs neighbourhoods supports the definition of summaries that go be- 
yond the traditional parent and child hierarchical relationships covered by the abundant 
literature on summaries |(;W97[ IMSMI IKBNKn2l IKSB(;n2l [PGn2l |QLU03i lBCF+n5 



IFGOGbj . In particular, AxPREs can describe heterogeneous SDs, i.e., SDs described by 
multiple AxPREs. 

An extent expression (EE) capable of computing precisely the set of elements 



in the extent of a given SD node (Chapter 5.1). Since an AxPRE a is used to 
compute by bisimulation an entire partition, we can say that all sets in the partition 
share the same AxPRE a. Thus, AxPREs cannot be used to uniquely identify each 
equivalence class (extent) in such partition (unless the partition contains only one set). 
For a large class of neighbourhoods, it is possible to precisely characterize the extent 
of an SD node s with a new type of expression we call extent expression (EE, for short). 
The EE Cs of s with AxPRE a is generated from the bisimilarity contraction of the a 
neighbourhoods of the elements in the extent of s. (Recall that all nodes in the extent 
of s have bisimilar AxPRE neighborhoods.) Thus, we pick any element in the extent 
of s, compute its a neighbourhood, and then compute its bisimilarity contraction. The 
representative neighbourhood thus obtained is guaranteed to be bisimilar to all neigbour- 
hoods in the extent of s. A representative neighbourhood provides the sequence of axis 
compositions and labels that will appear in the EE that computes the extent of s. EEs 
can be expressed in XPath and function like virtual views (see Chapter [gJ. 



The notion of AxPRE refinements of SD nodes (Chapter 5.2). Exploring 
collections of XML documents typically requires knowledge of the metadata present in 
the collection. SDs provide a descriptive tool for representing metadata as SD graphs. 
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The description provided by a node in the SD can be changed by an operation that 
modifies its AxPRE and thus its AxPRE neighbourhood. This operation is called an 
AxPRE refinement of an SD node. Refinement refers to applying summarization to 
selectively produce more or less detailed SDs. 

The notion of refinement is well-known in the XML literature |PT87j . Intuitively, 
two nodes in the same equivalence class may be refined into different classes, and two 
nodes from different classes will always be refined into separate classes. An SD node can 
be refined by changing its AxPRE definition. This produces SDs that are tailored to 
the exploration needs of the user. Using successive node refinements, SD nodes can be 
refined to produce SDs that provide a more detailed description of the data. 

Previous proposals perform global refinements on the entire SD graph |KBNK02| 
IKSB(;n2] or local refinements based on statistics or workload [QLOOSj IHYn4t IPC^OGbj . 



without the ability to define the refinement declaratively. In contrast, we can precisely 
characterize the neighbourhood considered for the refinement with an AxPRE |CRV08] . 
The notion of refinement is tightly related to that of stabilization. An edge stabiliza- 
tion determines the partition of an extent into two sets based on the participation of the 
extent nodes in the axis relation the edge represents. 

1.1.2 Refinement lattice 

We show the existence of a hierarchical relationship between summaries and provide 
a concise description of the hierarchy within the DescribeX framework based on a re- 
finement lattice. A refinement lattice describes a refinement relationship between entire 
summaries. 

The DescribeX refinement lattice provides a mechanism for capturing earlier summary 
proposals, and understanding how those proposals relate to each other and to richer SDs 
that were never previously considered in the literature (see Chapter H). Each node in 
the lattice corresponds to a homogeneous SD defined by an AxPRE. The top (coarsest) 
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summary of the lattice corresponds to the label SD where each node is partitioned by 
label, and the bottom (finest) summaries of the lattice each corresponds to a distinct 
combination of axes. 

1.1.3 System implementation 

In Chapter [7| we present the implementation of the DescribeX summarization engine 
for interactively creating and refining AxPRE summaries given large collections of XML 
documents. Chapter [8] provides experimental results that validate the performance of 
the techniques employed by DescribeX. 

The engine uses Berkeley DB Java Editioi]|jto store and manage indexed collections, 
and supports an arbitrary XPath processor for the evaluation of XPath expressions. A vi- 
sual interactive tool based on the DescribeX framework, DescribeX-Eclipse (see Appendix 
[B| ), was developed as an Eclipse^ plug-in. In addition to the DescribeX summarization 
engine presented in this thesis, DescribeX-Eclipse provides retrieval and visualization 
tools implemented by other colleagues |ACKR08] . 

Our experiments (employing gigabyte XML collections) provide strong evidence of the 
advantages of using DescribeX to build and exploit summaries for exploration and XPath 
query evaluation. These results demonstrate that the simple mechanism of accessing a 
summary extent employed by the DescribeX implementation yields speedup factors of 
over two orders of magnitude over commercial and open source implementations. 

1.1.4 Answering queries using extents 

For evaluating a query using an SD, we need to find the SD nodes that participate in the 
answer. Since our framework relies on XPath EEs for defining the extents, the problem 
of answering queries using extents is related to that of XPath containment |Sch04] . 



^http : //www . oracle . com/technology/products/berkeley-db/ j e/index . html 
^°http : //www . eclipse . org/ 
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DescribeX can derive AxPREs from queries and use them to change the descrip- 
tion provided by the SD. Since AxPREs describe only structural constraints and XPath 
queries may contain predicates on values, extents resulting from AxPRE manipulation 
rarely provide the exact answer without further filtering. The main reason for this is that 
the addition of an XPath value predicate either reduces the size of the answer or leaves 
the answer unchanged. Thus, DescribeX finds first the SD nodes that participate in the 
answer (i.e., those whose extents contain at least part of the answer), then evaluate the 
entire expression on them and take the union of the results to get the exact answer (see 



Chapter 6.4). The experimental results provided in Chapter 8.4 considerably expand the 



preliminary results presented in |CR07j . 

1.2 Motivating example: exploring RSS feeds with 
summaries and XPath queries 

This section walks through a concrete example to illustrate how DescribeX summaries 
can help developers perform collection-wide exploration and XPath query evaluation. 

Consider a developer. Sue, who has to implement a web application that retrieves 
RSS feeds from several content providers to produce an aggregated meta-feed. The feed 
may span several days or weeks, and there might be more than one item in the feed per 



day. Figure |1.2| shows the instances of two sample RSS feeds represented as axis graphs. 
An axis graph can display selected binary relations between elements in an XML 
document tree, like doc, c, fs, and fc shown in the figure (shorthands for XPath axes 
document, child, following- sibling, and for the derived axis firstchild, respectively). The 
semantics of these axes is straightforward: the edge from element 6 to 7 labeled fc means 
that 7 is the first child of 6 in document order, and the edge from element 18 to 24 
labeled fs means that 24 is a following sibling of 18 in document order. For simplicity, 
even though every first child is also a child, we do not draw the c edge between two 
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Figure 1.2: Axis graphs of RSS feed samples 

nodes when an fc edge exits between them. Being binary relations, axes have inverses, 
e.g., the inverse of c is p (shorthand for parent) and the inverse of /s is ps (shorthand for 
preceding-sibling) . These inverses are not shown in the figure. 

Using DescribeX, Sue can create a summary descriptor (SD for short) like the one 



shown on Figure 1.3 (a). This label SD, created from the two feeds in Figure 1.2 , partitions 
the elements in the feeds by element name. For example, SD node Sq represents all the 
item elements in the two documents, {6, 18, 24} (this set is called the extent of sq). 

An SD edge is labeled by the axis relation it represents. For instance, edge {3^,85) is 
labeled by c, which means that there is a c axis relation between elements in the extent 



of Sq and S5. Figure 1.3 (a) shows three kinds of edges, depending on properties of the 



sets that participate in the axis relation: dashed, regular, and bold. Dashed edges, like 
(sg, S5) labeled c, mean that some element in the extent of Sq has a child in the extent 
of S5. Regular edges, like (sg, S3) labeled fc, mean that every element in the extent of Sq 
has a first child in the extent of S3. Finally, bold edges, like (sg, Sg) labeled c, mean that 
every element in the extent of sg is a child of some element in the extent of sg and that 
every element in the extent of sg has some child in the extent of sg- 

From the label SD Sue learns that channel elements in the collection always contain 
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Figure 1.3: Label SD (a), and heterogeneous SD (b) of the RSS feed samples 

title, link, description, and item subelements. However, the structure of item elements 
may vary. An item in the two sample feeds always includes title and enclosure elements, 
but may contain any combination of description and pubDate elements. Note that the 
label SD does not provide information on exactly which combinations actually appear. 
At this point Sue has two options: 

1. She can interactively refine the SD node S2 in the label SD in order to learn how 
many different types of channels exist in the collection (i.e., how many subsets of 
title, enclosure, description and pubDate are present within item elements). 

2. Since she already knows that some item elements have a pubDate from the label 
SD and she is interested in channels that contain such items, she can write query 
Ql to retrieve them. 



Ql = /rss/channel [item [pubDate]] 
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Sue can now decide either to run Ql using the current SD or to make DescribeX 
adapt the current SD to Ql. If she picks the former option, DescribeX finds the only SD 
node that contains a superset of the answer (S2) and runs Ql on its entire extent. If Sue 
chooses the latter option, DescribeX changes the SD by partitioning the single channel 



node S2 in Figure 1.3 (a), which represents all channels in the collection, into two channel 



nodes: one with a pubDate within their item elements and another without a pubDate 



S22 and S21 in Figure 1.3 (b), respectively). Both SDs can be used to evaluate query 



Ql, but notice this latter refinement (the SD of Figure 1.3 (b)) will yield a more efficient 
evaluation. 

Summaries in DescribeX are defined and manipulated via AxPREs. AxPREs describe 
the neighbourhood of the elements in a given extent. A neighbourhood of an element v 
for an AxPRE a is the subgraph local to v that matches a. For instance, the p* AxPRE 
describes the neighbourhood of v containing all label paths from v to the root, c* all 
label paths from v to the leaves, and fens* the sequence of f 's child labels. AxPREs 
can also be derived from a query in order to adapt an SD to it. For example, the 



[channel].c.c AxPRE of node S21 in Figure 1.3| (b) was derived from Ql and describes the 



neighbourhood of channel elements with common outgoing label paths of length 2 (more 
on this in Chapter |3|. Sue could have written the [channel]. c.c herself had she wanted 
to refine the channel node S2 according to the substructure of the channel elements in 
the extent of S2 (since she knows from the label SD that the variability within channel 
elements may only come from description and pubDate within item subelements, the c.c 
AxPRE representing outgoing label paths of length 2 suffices). 

Suppose further that Sue is also interested in item elements containing both title and 
enclosure subelements, but she does not know whether such items exist in the collection 
and, if they do, how common they are. In addition, she wants those items to be part of 
a series (i.e., to belong to channel elements that contain more than one item element, as 
done in feeds for podcasts published daily). Therefore, she writes another query: 
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Q2 = /rss/channel[item/f ollowing-sibling: :item] 
[not (pubDate=. . /item [1] /pubDate)] /item [title] [enclosure] 

Q2 contains structural (in black) and non-structural (in grey) XPatli constructs. The 
expression that results from removing all non-structural constraints is called the structural 
subquery of Q2. A structural subquery provides insight into the behaviour of the entire 
query and can be used by DescribeX to refine an SD. 

As with Ql, Sue can decide to either evaluate Q2 on the current SD (the label 
SD with the refined channel node) or to add Q2 to the workload and make DescribeX 
adapt the current SD. Assuming she chooses the second option, the system partitions 
the item node sq from Figure |1.3K a) into the nodes sei and sq2 in Figure 1.3[ b) that 



describe the structure of the collection with respect to the workload including Q2 and 
Ql. Note that the extent of node sq2 is exactly the answer to the structural subquery 
of Q2, and thus a superset of the answer of Q2. The elements in this extent are called 
candidate elements. Hence, by adapting the SD to the structural subquery, DescribeX 
has considerably reduced the search space for computing the entire query. 

In a document-at-a-time approach to query evaluation, adapting an SD to a work- 
load can reduce the number of documents on which queries in the workload need to 
be evaluated, potentially yielding a significant speedup (see Chapter [s]) . That is, after 
adapting the SD to a given query Q, DescribeX can evaluate Q only on those documents 
(called candidate documents) that are guaranteed to provide a non-empty answer for the 
structural subquery of Q. Those candidate documents that do contain an answer for the 
entire query are called answer documents. 

It is important to note that DescribeX can recognize two kinds of channels with 
different structure beyond the elements directly contained by them, a capability not 
available using DTD's (unless channel elements are renamed, which is not a possibility 
when the original DTD or the instances cannot be modified). In particular, proposals to 
infer a DTD from an instance (such as |BNST06"l IGGR^OS] ) by suggesting (general, but 
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succinct) regular expression from the strings of child elements, do not help to identify 
the two kinds of channels as done above. For instance, the DTD expression <! ELEMENT 
channel (title, link, description, item)> can be inferred for the channel elements 



occurring in the instances shown in Figure 1.2 However, a DTD can only give a rule for 



the children of channel, there is no mechanism for giving rules relating channel elements 
to their grandchildren (or any other elements farther away). In contrast, the AxPRE 



summary in Figure 1.3 (b) can distinguish between a channel containing an item with a 
pubDate element from those that contain a description, and also between item elements 
that belong to a multi-item channel from single-item ones. 

As we will show in this thesis, DescribeX is not only more expressive than DTD's 
and XML Schemas, but also more expressive than other summary proposals making it a 
robust foundation for managing large document collections. 



1.3 Organization 

This thesis is structured as follows. Chapter [2] gives an overview of the large body of 
related work in the literature. Chapter |3] introduces the DescribeX framework, includ- 
ing the AxPRE language and some basic notions such as neighbourhood, bisimilarity, 
and summary descriptor (SD). Chapter 111 revisits some of the related work discussed in 
Chapter [2] and explains how they can be captured by the DescribeX framework and how 
DescribeX offers significant new functionality. Chapter |5] presents two new operations, 
AxPRE refinement and stabilization, for declaratively changing the description provided 
by an SD using AxPREs. Refinement and stabilization are central to the use of sum- 
maries for both structure understanding and query processing. Chapter |6] introduces a 
novel mechanism to characterize an SD node with an XPath expression whose evaluation 
returns exactly the elements in the extent. It also discusses how to compute AxPRE re- 
finements and stabilizations with XPath expressions and how to evaluate XPath queries 
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using DescribeX summaries. Chapter [7] describes the implementation of the DescribeX 
summarization engine for creating and manipulating SDs of XML collections. Chapter |8] 
provides experimental results, using gigabyte size XML collections, that validate the per- 
formance of the techniques employed by our framework. We conclude in Chapter [9] by 
presenting some future research issues. In addition. Appendix |A] provides a concise defi- 
nition of the formal semantics of XPath 1.0, and Appendix [B] presents a visual interactive 
tool built on top of the DescribeX summarization engine. 



Chapter 2 



Related work 



In this chapter, we discuss contributions from the hterature on structural summaries 
and other areas related to our work, such as path summaries for object-oriented data, 
hierarchical encodings, answering XML queries using views, and validating summaries. 

2.1 Structural suramaries 

The large number of summaries that have been proposed in recent years clearly establishes 
the value and usefulness of these structures for describing semistructured data, assisting 
with query evaluation, helping to index XML data, and providing statistics useful in 
XML query optimization. 

Most of the summary proposals in the literature define synopses of predefined sub- 
sets of paths in the data. They construct a labeled graph that represents relation- 
ships between sets of XML elements. Examples of such summaries are region inclu- 
sion graphs (RIGs) |(M94j . representative objects (ROs)lNUWC92j, dataguides [GW92], 
reversed dataguides |LS00j . 1-index, 2- index and T-index |MS99j . and more recently, 
ToXin IRMOlj . A(k)-index [KSBG02], F-Index, B-index, and F&B-Index |KBNKn2] . 
Dataguides and ROs group nodes into sets according to the label paths incoming to 
them (each node may appear more than once in the dataguide if the document in- 

18 
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stance is not just a tree). RIGs, 1-index, T-index, ToXin, F&B-Index, and F+B- Index, 
on the other hand, partition the data nodes into equivalence classes (called extents in 
the literature) so that each node appears only once in the summary. The partition is 
computed in different ways: according to the node labels (RIGs), the label paths in- 
coming to the nodes (1-index, ToXin, A(k)-index), the label paths going out from the 
nodes (reversed dataguides), or label paths both incoming and outgoing (F&B-Index and 
F-|-B-Index). The length of the paths in the summary also varies: ToXin, 1-index and 
F&B-Index/F+B-Index summarize paths of any length, whereas A(k)-index is a synop- 
sis of paths of a fixed length. Updates to structural summaries have been studied in 
|KBNSn2] and |YHSYn4j . 

RIGs were one of the first summaries proposed in the literature, introduced in the 
context of region algebras |CM94t IYLT03J . Dataguides |GW97] group nodes in a rooted 
data graph into sets called target sets according to the label paths from the root they 
belong to. Since the label paths form a language, its deterministic finite automaton 
(DFA) is used as a more concise representation of the label paths. The construction of 
a dataguide from a data graph is equivalent to the conversion of a NFA (the XML tree) 
into a deterministic finite automaton (the dataguide) |NUWC97] . 

An index family was presented in |MS99j (1-index, 2-index, and T-index). Like 
dataguides, the 1-index summarizes root-to-leaf paths. In the 1-index, the nodes of a 
XML tree are partitioned into equivalence classes according to the label paths they be- 
long to. Since the 1-index extents constitute a partition of the XML nodes, the number 
of 1-index nodes can never be bigger than the XML tree. The extreme case is the one in 
which every XML node belongs to a separate equivalence class (which is in fact the data 
instance). The 1-index partition is computed by using hisimulation |PT87j . 

Based also on bisimulation, the A(k)-index was introduced in |KSBG02] . The con- 
struction of the summary is based on /c-bisimilarity (bisimilarity computed for paths of 
length k). Thus, the A(0)-index creates the partition based on the labels of the nodes 
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(O-bisimilarity) , and the A (h) -index uses /i-bisimilarity which creates the partition based 
on incoming label paths of length h. 

Another index family was introduced in |KBNK02] . The F&B-Index construction uses 
bisimulation like the 1-index, but applied to the edges and their inverses in a recursive 
procedure until a fix-point. With this construction, the F&B-Index's equivalence classes 
are computed according to the incoming and outgoing label paths of the nodes. The 
same work introduces the F+B-index, which applies the recursive procedure only twice, 
once for the edges and another reversing the edges. Both F&B-lndex and F+B-index are 
special cases of the BPCI(k,j,m) index, where k and j controls the lengths of the paths 
and m the iterations of the bisimulation on the edges and their inverses. 

ToXin consists of three index structures: the ToXin schema, the path index, and the 
value index. The ToXin schema is equivalent to a strong dataguide. The path index 
contains additional structures that keep track of the parent-child relationship between 
individual nodes in different extents. A recent proposal, Templndex |MRV04j extends 
ToXin with the temporal dimension in order to speed-up path queries on a temporal 
XML data model. Templndex summarizes incoming paths that are valid continuously 
during a certain time interval and is part of the TSummary framework |RV08j . 

Based on the A(k)-index, a recent proposal |FGW"'"07] defines partitions of paths, 
rather than nodes, called P(k)-partitions - where k is the maximum length of the paths 
summarized. This work also introduces an algebraization of the navigational core of 
XPath in order to define XPath fragments that can be coupled to P(k) -partitions for 
fast evaluation of queries in the fragments. Since this proposal is based on navigational 
XPath, it supports only expressions containing composition of parent, ancestor, child, 
and descendant axes. In contrast, DescribeX can be used to evaluate expressions in the 
complete XPath language (with all the axes, functions, use of parenthesis, etc.). 

Other summaries are augmented with statistical information of the instance for selec- 
tivity estimation, including path/branching distribution (XSKETCH |P(;n2[lDPGMn4] ). 
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value distributions (XCLUSTER |PG06aj ). and additional statistical information for ap- 
proximate query processing (TREESKETCH |PGI04j ). 



A few adaptive summaries like APEX |CMS02j . D(k)-index |QLU03|, and M(k)-index 



[HY04J use dynamic query workloads to determine the subset of incoming paths to be 
summarized. APEX is a summary of frequently used paths that summarizes incoming 
paths to the nodes and adapts to changes in the workload by changing the set of path 
considered in the synopsis. That is, instead of keeping all paths starting from the root, 
it maintains paths that have some "support" (i.e., paths that appear a number of times 
over a certain threshold in the workload). The workload APEX considers are expressions 
containing a number of child axis composition that may be preceded by a descendant 
axis, without any predicate. APEX summarizes incoming paths to the nodes and adapts 
to changes in the workload by changing the set of paths summarized. D(k)-index and 
M(k)-index, in contrast, summarize variable-length paths based on both the workload 
and local similarity (the length of each path depends on its location in the XML instance). 

There has been almost no work on summaries that capture the node ordering in the 
XML tree: the only proposals we are aware of are the early region order graphs (ROGs) 
|CM94j and the Skeleton summary |BCF"'"05| that clusters together nodes with the same 
subtree structure. Skeleton has additional structures that store relationships between 
individual nodes that belong to different equivalence classes. 

In contrast to these proposals, DescribeX is capable of declaratively defining complex 
mappings between instance nodes and summary nodes for expressing order, cardinality, 
and relationships that go beyond the traditional parent-child (e.g., next sibling, following, 
preceding, etc.) In addition, DescribeX provides a declarative definition for the first time 
for most of the proposals discussed above (for more details on how DescribeX captures 
other structural summaries see Chapter |4]). 
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2.2 Path summaries for OO data 

We can trace the origin of structural summaries for XML to the OODB community. This 
community has been quite active in the past in the area of path summaries for object- 
oriented data. Examples are path indexes |Ber94j . access support relations |KM90j . 
and join index hierarchies |XH94j . All three proposals materialize frequently traversed 
paths in the database. Access support relations are designed to support joins along ar- 
bitrary reference chains leading from one object instance to another. They also support 
collection-valued attributes by materializing frequently traversed reference chains of ar- 
bitrary length. Access support relations are a generalization of the binary join indices 
originally proposed for the relational model |Val87j . One fundamental difference with 
respect to join indices, however, is that rather than relating only two relations (or object 
types), access support relations materialize access paths of arbitrary length. 

A path index can materialize the same class of paths as an access support relation. It 
stores the sequence of nodes (objects) that define a given path. In contrast, a join index 
hierarchy constructs hierarchies of join indices to optimize navigation via a sequence of 
objects and classes. A join index stores the pairs of identifiers of objects of two classes 
that are connected via logical relationships. Since all these OODB approaches are based 
on the paths found in the 00 schema, they can only be adapted to XML documents 
for which either a DTD or an XML Schema is present. In contrast, DescribeX permits 
summarization of collections without any schema. 

2.3 Hierarchical encodings 

We should mention that, in addition to the use of summaries, query evaluation can be 
facilitated by encoding the hierarchical structure of an XML instance. Node encoding 
evaluations use some sort of interval encodings |SK85] to label each node with its posi- 
tional information within the XML instance. This positional information is used by join 
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algorithms to efficiently reconstruct paths and label paths. Recent proposals for node 
encoding evaluations are region algebras |CM94l IYLT03J . path joins (XISS) |LM01j . rel- 
ative region coordinates |KYU01j . structural joins jAKJP+Olj lCVZ+02] . holistic twig 
joins |BKS02l IJWLYOSj . partition-based path joins |LM03j . XR-Tree |JLWO03j . PBi- 
Tree |WJLY03l IVMT04] . extended Dewey encoding for holistic twig joins |LLCC05] . and 



FIX [ZOIA06J, a feature-based indexing technique. 

Structural encoding proposals are based on mapping the XML tree structure into 
strings and use efficient string algorithms for query processing. Since the size of each 
string grows with the length of the encoded path, many approaches use some sort of 
compression to offset this overhead. Examples of those are Index Fabric |CSF"'"01j . tree 
signatures |ADR+04] . materiahzed schema paths |BW03] . PathGuides [CYWY03] . and 
tree sequencing (ViST |WPFYn3j . PRIX |RMn4j . and NoK |ZKO04J). These encodings 



can be used in conjunction with structural summaries to improve query evaluation per- 
formance. In fact, the availability of summaries can be of great assistance to an XML 
optimizer |B(Mn5j . 

DescribeX uses an interval encoding derived from |SK85] in which each element in 
the collections is represented by its start and end positions (the character offset from the 
beginning of the document they belong to). 

2.4 Answering XML queries using views 

Another area closely related to summarization is answering queries using views. As 
in traditional database systems, the performance of XML queries can be improved by 
rewriting them using caching and materialized views containing information relevant to 
the computation of the query. A recent contribution in this area includes a framework 
for XPath view materialization and query containment |BOB"'"04j that uses value and 
structure indexes on views. Another framework was proposed in |MS05j for maintaining 
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a semantic cache of XPath query results as materialized views used to speed-up query- 
processing. Other work has considered the problem of deciding the existence of a query 



rewriting and finding a minimal rewriting using XPath views |XO05] . and computing 
maximal contained rewriting for tree pattern queries (a core subset of XPath) |LWZ06] . 

For XQuery, query rewriting poses additional challenges. One of them is that queries 
may be nested. Another challenge comes from the mix of list, bag and set semantics 
supported by XQuery, which makes testing equivalence more difficult. In this context, 
there has been some work on query rewriting for nested XQuery queries using nested 
XQuery views |ODPC06] . A recent contribution for extended tree patterns views (a 
subset of XQuery) proposes containment and equivalent rewriting strategies based on a 
dataguide enhanced with integrity constraints |ABMP07] . This proposal considers only 
queries described by tree patterns. 

We must point out here that most of the work in this area could be applied to our 
framework to expand the query evaluation techniques we present in Chapter [6j 

2.5 Validating summaries 

DTDqJ and XML Schemaqj are proposals used for validation and verification of XML 
documents. A DTD is a context-free grammar and an XML Schema is a typed definition 
language. Both are schemas in the database sense, and thus describe classes of documents 
and constrain their structure. However, they provide only a limited description of the 
instances that satisfy them and no mechanism to locate specific instance fragments. In 
contrast, summaries are constructed for a particular instance and consequently provide a 
tighter description of the data. They also contain the necessary information for locating 
the instance fragments they describe. DTDs and XML Schemas can be used to constrain 
the construction of summaries but they are no substitute for them. Moreover, summaries 



^ http://www.w3.org/TR/REC-xml 

^ http : //www . w3 . org/TR/xmlschema-l/ 
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Figure 2.1: Label SD fragment with XML graph schema model annotations 

can be constructed even when DTDs and XML Schemas are not present. 

In addition to describing an instance, DescribeX summaries could potentially be used 
for prescribing or constraining the data by adding schema constructs. Figure |2.1 shows 



a fragment of the label SD from Figure 1.3 (a) annotated with XML graph schema 



constructs |MS07j in blue. These constructs, which contain choice and sequence nodes 
(and others not shown in the figure), are able to express XML schema languages like 
DTDs, XML Schemas, and Relax NGr] (For a survey on XML schema languages see 
[MLMK05] .) The SD of Figure [2!T| represents channels that contain exactly one title, one 
description and one or more items that contain themselves one title and a sequence of 
zero or more description elements. In the figure, choice and sequence nodes are used to 
represent the number of occurrences of an element, which can be zero, one, or unbounded. 



The DTD corresponding to the elements that appear in Figure 2.1 is the following: 
<! ELEMENT channel (title, item+, description) > 
<! ELEMENT item (title, description?) > 
<! ELEMENT title (#PCDATA)> 
<! ELEMENT pubDate (#PCDATA)> 



http : //www . oasis-open . org/committees/relax-ng/spec-20011203 . html/ 
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In an SD, schema annotations have to be consistent with instance descriptions. For 
example, the existential edge (sg, S5) is compatible with an schema permitting any number 
of occurrences of S5 (even zero). In contrast, the same edge is incompatible with an 
schema requiring at least one S5 element because the dashed edge allows some items to 
have no descriptions. 

This is just an example of how schema constructs can be integrated with DescribeX 
summaries. There are many other ways of approaching the subject, but we do not 
consider it further in this thesis. 

This chapter provided a discussion of related work on structural summaries and four 
other areas close to our work: path summaries for 00 data, hierarchical encodings, 
answering XML queries using views and validating summaries. In the next chapter, we 
begin introducing one of the major contributions of this thesis, the DescribeX framework. 
We will show how DescribeX generalizes and extends both structural and path summaries, 
and how DescribeX summaries can be used in query processing. 



Chapter 3 



AxPRE summaries 



This chapter provides an overview of the DescribeX framework. The framework includes 
a powerful language based on axis path regular expressions (AxPREs) for describing each 
set in a partition of instance nodes (extents). AxPREs provide the flexibility necessary for 
declaratively specifying the mapping between instance nodes and summary nodes for a 
given collection. These AxPRE mappings are capable of expressing order and cardinality, 
among other properties. AxPREs are evaluated on a graph (called axis graph) in which 
nodes are XML elements and edges are binary relations between them. Hence, AxPREs 
can be viewed as path regular expressions on binary relations. These relations include 
all XPath axes and additional ones that can be expressed in XPath. 

Extents are defined using a novel approach: selective bisimilarity applied to subgraphs 
described by AxPREs (i.e., AxPRE neighbourhoods) . This particular use of bisimulation 
supports the definition of summaries that go beyond the traditional parent and child 
hierarchical relationships covered by the abundant literature on summaries. Intuitively, 
nodes that have bisimilar subgraphs "around" them (i.e., neighbourhoods) belong to the 
same extent. For instance, DescribeX can define extents containing only nodes with the 
same set of outgoing label paths matching a given sequence of axes. Neighbourhoods are 
a key mechanism in the declarative definition of DescribeX summaries. 

27 
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Figure 3.1: The axis graph of two PSI-MI interactions 

3.1 A regular expression language on axes 

In this section, we introduce the AxPRE language for describing neighbourhoods in an 
SD. For representing an XML instance, DescribeX uses a labeled graph model called an 
axis graph. 



Definition 3.1 (Axis Graph) An axis graph A = {Inst, Axes, Label, A) is a structure 
where Inst is a set of nodes, Axes is a set of binary relations {E-f, . . . , -E;^ } in Inst x Inst 
and their inverses, Label is a finite set of node names, and X is a function that assigns 
labels in Label to nodes in Inst. Edges are labeled by axis names. D 



An axis graph is an abstract representation of the XPath data model |W3C07j ex- 
tended with edges that represent XPath binary relations between elements. It can 
also include additional axes, such as fc (where fc := child :: *[1]), ns (where ns : = 
f ollowing-sibling :: *[1]), id-idrefs or any binary relation that can be expressed in XPath. 
When representing an XML instance, axis graph nodes are labeled by element or attribute 
names (including namespaces). 
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Example 3.1 (PSI-MI Axis Graph) Figure 3.1 shows an axis graph for our running 
example, which is a sample of a protein-protein interaction (PPI) dataset in PSI-Mr\ 
format. PSI-MI stands for Proteomics Standards Initiative Molecular- Interaction and is 
the de-facto model for PPI used by many molecular interaction databases such as Bi- 
oGRII^ Human Protein Reference Database (HPRDJ^ and IntAcQ The PSI-MI XML 
schema has a large number of optional elements to allow flexibility, with the result that 
PSI-MI data can be very heterogeneous. Since different databases use different fragments 
of the schema, finding common instance patterns and understanding schema usage can 
be challenging ISCKTOT^ . 

Each interaction consists of an experimentList element with all the experiments in 
which the interaction has been determined, a participantList element with the molecules 
that participate in the interaction and some optional elements like the name of the inter- 
action and a reference (xref) to an interaction database. Each participantList contains 
two or more participants, which are the molecules participating in the interaction. A par- 
ticipant element contains a description of the molecule, either by reference to an element 
of the interactorList, or directly in an interactor element. In addition, each participant 
contains a list of all the roles it plays in the experiments (e.g., bait, prey, neutral, etc.) 

Note that, for the sake of clarity, we have omitted many edges depicting relations 
that actually exist. For example, the fc (firstchild) relation is included in the c (child) 
relation, so any fc edge is also a c edge. The inverses of each relation are not shown in 
the figure, e.g., for each c relation, a p (parent) relation exists (since p = c^^). D 

An AxPRE gives a declarative description of a partition of elements in an SD, some- 
thing not provided by any other proposal in the literature. 



^http : //psidev . sourcef orge . net/mi/xml/doc/user/ 

^http : //www . thebiogrid . org/ 

^http : //www . hprd . org/ 

^http : //www . ebi . ac . uk/intact/ 
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In an axis graph we define paths and label paths as usual. We call a path defined on 
edges an axis path, and the string resulting from the concatenation of its labels is an axis 
label path. 

Definition 3.2 (Axis Path and Axis Label Path) Let J\f be a connected subgraph 
of an axis graph A, and v, Vn be two nodes in M such that there is a path p = {v, axisi, 
f 1, axis2, . . . , axiSn, Vn) from v tovn- The axis path ofp is the string ap = axisi.axis2- ■ ■ ■ 
.axiSn- The axis label path of p is the string X{p) = axisi[X{vi)].axis2[X{v2)] axiSn 

[KVn)]. □ 



Example 3.2 Consider the axis graph of Figure \3.1\ Two of the paths from node 6 to 11 
are p = (6, c, 8,/c, 9, ns, 11) and p' = (6, c, 8, c, 11). Their axis paths are ap = c.fc.ns and 
ap' = CO, respectively. Finally, the axis label paths ofp andp' are X{p) = c[expRoleList]. 
fc[expRole]. ns[expRole] and A(p') = c[expRoleList]. c[expRole], respectively. D 

Definition 3.3 (Axis Path Regular Expressions) An axis path regular expression 
(AxPRE) is an expression generated by the grammar 

E < — axis I axis[B{l)] | (^ | ^) | (E)* \ E.E \ e \ [B{1)] 

where axis G Axes and e is the symbol representing the empty expression. D 



Definition |3.3| describes the syntax of path regular expressions on the binary relations 
(labeled edges) of the axis graph including node label tests. The function B{1) is a 
boolean function on a label / G Label that supports elaborate tests beyond just matching 
labels. 

An AxPRE defines a pattern we want to find in an instance. We need a way of 
computing all occurrences of such pattern in an axis graph - each occurrence will be 
called a neighbourhood. We do this by computing an automaton for the AxPRE, another 
for the axis graph, and then taking the intersection. Finally, a summary will group nodes 
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Figure 3.2: Axis graph fragment from node 8 (a) and its automaton 7W_4(8) (b) 

with similar patterns together into an extent (DescribeX uses bisimulation as the notion 
of similarity) . 



The AxPRE semantics (Definition 3.8 ) is given by the notion of AxPRE neighbourhood 



of a node (Definition 3.7). In order to compute an AxPRE neighbourhood we need first 
to define an automaton from the axis graph. Such an automaton will have two states 
for each node in the axis graph, one named head and the other tail. In addition, edges 
in the graph will be represented as transitions between tail and head states, and node 
labels as transitions between head and tail states. 



Definition 3.4 (Axis Graph Automaton) Let A = (Inst, Axes, Label, A) be an 
axis graph and v a node in A. The axis graph automaton of A from v, A^^(f) = 
{Q, S, 6, go, F}, is an automaton \HU79l defined as follows: 



For each node w G Inst there is a state head{w) E Q, a state tail{w) G Q and a 
transition 5{head{w), [X{w)]) = tail{w); 

For each edge {wi,Wj) labeled axis in A there is a transition S{tail{wi),axis) = 
head{wj); 
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• All tail{w) states in Q, w & Inst, are final states in F , and head{v) is the initial 
state go- 

D 

Example 3.3 Consider node 8 of our running example. Figure \3.^ shows on the left 
hand side a fragment of the axis graph that contains node 8. The axis graph automaton 
from node 8 (on the right hand side of the figure) has head{8) as initial state and all 
tail states as final. Each node in the axis graph fragment is unfolded into a head and 
a tail states in the automaton and its label is represented by a transition between them. 
Consider node 11 with label expRole that has ns and c incoming edges and a fc outgoing 
edge in the axis graph. In the automaton, 11 is represented by a head{ll) state that 
has ns and c incoming transitions and an outgoing transition [expRole] to tail{ll). The 
outgoing fc edge is translated into a fc transition from tail{ll) to the head state of the 
corresponding node, which is 12. D 

An automaton can be obtained from an AxPRE following the usual Thompson's 
construction for regular expressions with a minor change to the basis steps to account for 
AxPRE semantics (which require accepting all prefixes of the language). The language 
accepted by the so called AxPRE automaton thus constructed will always be prefix- 
closed. (A language L is said to be prefix-closed if, given any word / G L, all prefixes of 
/ are also in L |HU79] .) 

Definition 3.5 (AxPRE Automaton) Let a be an AxPRE. The AxPRE automaton 
of a is an automaton Aia obtained from a with a modified Thompson's construction 



IHUVS^ for accepting all prefixes (Figure 3.5), in which only the final states of the basis 



rules are kept as final in the resulting automaton (the inductive rules for concatenation, 
disjunction and Kleene closure do not mark any additional state as final). The transition 
function 6{qa,axis) returns the states that can be reached by an axis transition after 
following an arbitrary number (possibly zero) of e transitions. D 
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Figure 3.5: Intersection automaton A^^(8) nA^[ea;pRo/eLJst]./c.ns* (a) and resulting AxPRE 
neighbourhood M^expRoieListUcns^i^) (b) 

Example 3.4 Consider the AxPRE [expRoleList]. fc.ns* and its automaton in Figure 



3.4 The application of rule axis[l] of the modified Thompson's construction creates 
states qo, qi and the [expRoleList] transition between them. The application of rule axis 
creates q2, qs, q^, q&, and the [/i], . . . , [/„] transitions from q2 to q^ and from q^ to gg 
(there is one transition [li] for each string in Label). The final automaton is obtained by 
applying the concatenation and Kleene closure rules. D 

An automaton for the intersection of two languages can be constructed by taking the 
product of the automata for the two languages |MW95l lYanQOj . 

Definition 3.6 (Intersection Automaton) Let M.^{v) be the automaton of an axis 
graph A from a node v, and Aia be the automaton of an AxPRE a. The intersection 
automaton A^^(t>) nAia is an automaton in which states are pairs (g^, qa) consisting of 
a state g^ G Aij^{v) and a state qa G Ma, o,nd there is a transition (5((g^, qa), X) = (g^, 
g^) if there are transitions 5(g^, X) = g^ in A4j\^{v) and 6{qa, X) = g^ in A4a, where X 
is either an axis or a label. A state {qji,, qa) is final (initial) if both g_4 and qa are final 
(initial). D 



The machinery introduced in Definitions |3.4| through |3.6| is required for computing 
AxPRE neighbourhoods of nodes in the axis graph. The neighbourhood of a node v by 
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a can be obtained by taking the intersection between the axis graph automaton from v 
and the AxPRE automaton of a, and then converting the resulting automaton to an axis 



graph fragment as described in Definition 3.7 



Definition 3.7 (AxPRE Neighbourhood of a Node) Let A = {Inst, Axes, Label, 
A) be an axis graph, v a node in A, a an AxPRE, and A^^(f ) fl Aia the intersection 
automaton of A4j^^{v) and M.^- The AxPRE neighbourhood of v by a, denoted Maiv), 
is the subgraph of A defined as follows: 

• For each transition 5{{head{w),qa),l) = (tail{w),q'^), where (tail{w),q'^) is a final 
state, there is a node w with label I in A; 

• For each transition 5{{tail{wi) , qa) , axis) = {head{w j) , q'^ , where (tail{wi),qa) is a 
final state, there is an edge {wi,Wj) labeled axis in A. 

D 

Example 3.5 (AxPRE Neighbourhood of a Node) Consider node 8 of our run- 
ning example. The intersection automaton A^^(8) fl M.[expRoieList].fc.ns'' is depicted in 



Figure 3.5 (a). States are labeled by pairs (qA^Qa), where q^ is a state in automaton 



A^_4(8) and q^ is a state in automaton A4iexpRoieList].fc.ns* ■ The intersection has been 



computed following Definition \3.6[ The figure shows only the states that have some in- 
coming or outgoing transition. Note that transition c between tail{8) and head{ll) is not 
part of the intersection because fc is the only outgoing transition from qi in qa. 

Figure 3.5 (b) shows the AxPRE neighbourhood of node 6, Af[participant]c.fc.ns*{Q), ob- 



tained by converting the intersection automaton to an axis graph fragment as described in 



Definition 3.1. Note that transitions from (head{v), . . .) to {tail{v), . . .) in the intersec- 
tion are node labels in the AxPRE neighbourhood and that transitions from {tail{v), . . .) 
to {head{w), . . .) are edge labels (axes) in the neighbourhood. 



Consider now the five [participant]. c.fc.ns* neighbourhoods depicted in Figure 3.6 



Neighbourhood (a) matches a prefix of the AxPRE ([participant]. c) whereas (b) through 
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Figure 3.6: All [participant]. c.f ens* neighbourhoods 

(e) match the entire AxPRE but with a different number of iterations in the Kleene closure 
for ns: 1 for (b) and (e), and for (c) and (d). D 

We formalize next the notion of AxPRE semantics based on AxPRE neighbourhoods. 

Definition 3.8 (AxPRE Semantics) Let A = {Inst, Axes, Label, A) be an axis graph 
and V a node in A. The evaluation of an AxPRE a onv returns the AxPRE neighbourhood 
of V by a. D 



3.2 Neighbourhoods and bisimulation 

AxPRE neighbourhoods allow us to define a notion of similarity between nodes in an axis 
graph. The idea underlying DescribeX is that nodes with similar AxPRE neighbourhoods 
will be grouped together. In particular, DescribeX uses the familiar concept of labeled 



bisimulation applied to AxPRE neighbourhoods, formalized by Definition 3.9 



Definition 3.9 (Labeled Bisimulation and Bisimilarity) Let Ma{vo) and f/jslwo) 
be two AxPRE neighbourhoods of an axis graph A = {Inst, Axes, Label, X), such that 
AxeSa C Axes and Axesp C Axes. A labeled bisimulation between Afa{vo) and Nj3{wq) 
is a symmetric relation ^ such that for all v G N'a{vQ), w G Ni3{wo), E" G AxeSa, 
and E^ G Axesp: if v k, w, then \{v) = \{w); if v k, w, and {v,v') G E", then 
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{w,w') G E^ and v' ~ w' . Two nodes v G Maivo), w G A/'/3(wo) '^^^e bisimilar, in no- 
tation V ^ w, iff there exist a labeled hisimulation k, between Afa{vo) and Afjs^Wo) such 
that V ~ w. Similarly, two neighbourhoods Maivo) and JVplwo) are bisimilar, in notation 

MaiVo) ~ Mpiwo), iffVQ^WQ. D 



Definition 3.9 captures outgoing label paths from the nodes. Bisimulation provides 
a way of computing a double homomorphism between graphs. The widespread use of 
bisimulation in summaries is motivated by its relatively low computational complex- 
ity properties. The bisimulation contraction of a labelled graph can be done in time 
0{m\ogn) (where m is the number of edges and n is the number of nodes in a labelled 
graph) as shown in |PT87j . or even linearly for acyclic graphs, as shown in |DPP04] . 
Using bisimulation also allows us to capture all the existing bisimulation-based proposals 
in the literature (Chapter 111). 

Example 3.6 Let us consider the nodes 6 and 18 in the axis graph of Figure\3.1\ Their 



[participant].c.fc.ns* neighbourhoods are depicted in Figure 3.6 (b) and (c), respectively 



Based on Definition \3.S\ we can define a labeled bisimulation k, between nodes 7 and 19 
because they have the same labels and they do not have outgoing edges. For the same 
reasons we have 11 ~ 21. However, it is not possible to define a labeled bisimulation 
between 9 and 21 because, even though they have the same labels, 9 has one outgoing edge 
whereas 21 does not. Thus, 9 96 21. This prevents us from defining a label bisimulation 
between 8 and 20 because they each have only one outgoing fc edge, but to nodes 9 and 
20, which are not bisimilar. Therefore, 8 96 20. Similarly, 6 96 18 because they have 
edges with the same labels (c) to nodes that are not bisimilar f8 and 2Q). Consequently, 



neighbourhoods (b) and (c) of Figure 3. 6 are not bisimilar. 



In contrast, let us compare now nodes 6 and 18 but with respect to their [participant]. c* 



neighbourhoods, which are depicted in Figure 3.1 (b) and (c), respectively. In this case 



we can have 9 ~ 21 and 11 f« 21 because all of them are leaves and have the same 
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Figure 3.7: All \partici'pani\.c* neighbourhoods 

label. Therefore, 8 ~ 20 because the outgoing edges from 8 go to nodes 9 and 11, which 
are bisimilar to the target node (21^ of the only outgoing edge from 20. Thus, 6 ~ 18 
because they have edges with the same labels (c) to nodes that in this case are bisimilar 



(7 ~ 19 and 8 ~ 20/ Consequently, neighbourhoods (b) and (c) of Figure 3.1 are in fact 
bisimilar. D 



Definition 3.10 (AxPRE Bisimilarity) Let A = {Inst, Axes, Label, A). When two 
nodes vq and wo in A have bisimilar neighbourhoods by the same AxPRE a, that is 
A/'a(fo) ~ Afa{wo), we say that vq and wq are AxPRE bisimilar by a or a-bisimilar, in 
notation vq ~" wq- D 



Example 3.7 Consider again the neighbourhoods in Figure 3^. Nodes 6 and 18 have 
non-bisimilar [participant]. c.f ens* neighbourhoods and thus 6 7^" 18, where AxPRE a = 
[participant]. c.fc. us* . However, if we consider now their [participant]. c* neighbourhoods, 
which are bisimilar, then 6 ~" 18 for AxPRE a' = [participant]. c* . D 

AxPRE bisimilarity is used for defining partitions of an axis graph. Intuitively, a so 
called AxPRE partition assigns two nodes v and w in an axis graph to the same class if 
their AxPRE neighbourhoods by a given a are bisimilar. This is formalized by Definition 

EH 
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Definition 3.11 (AxPRE Partition) Let A = {Inst, Axes, Label, A) be an axis graph 
and a an AxPRE. An AxPRE partition of Inst by a, denoted Va, is a set of pairwise 
disjoint subsets of Inst whose union is Inst defined as follows: two nodes v,w E Inst 
belong to the same set P^ E Va iff v ~" w. 

Definition 3.12 (Positive Classes) Let A = {Inst, Axes, Label, A) be an axis graph, 
a an AxPRE and P^ = {v E Inst \ N'a{v) = 0} the set of the empty neighbourhoods in 
the AxPRE partition of Inst by a. Then, V^ = Va — Pa ^^ the set of positive classes of 

Va- □ 

Since all nodes that have an empty AxPRE neighbourhood belong to the same equiv- 
alence class, Va and V^ differ in at most one set. 

Example 3.8 Consider the AxPRE partitions by [h], . . . , [In], where /i, . . . ,/„ are the 
different node labels that appear in the axis graph, have one positive class each because 
each neighbourhood represents a different node label. (Note that the n different positive 
classes do not overlap.) Moreover, the union of those n sets (each coming from a different 
partition) also constitute a partition of Inst. In contrast, if we take only a proper subset 
of m node labels, m < n, the m positive classes of the resulting AxPRE partitions do not 
constitute a partition because their union does not have all nodes in Inst. D 

Given an AxPRE, the positive classes plus one additional class for the empty neigh- 
bourhood forms a partition. If we have another AxPRE whose positive classes fall ex- 
clusively within this empty neighbourhood class, then these two AxPREs may be used 
together to summarize an axis graph. We are interested in sets of AxPREs whose positive 
classes define a partition of Inst, which is formalized next. 

Definition 3.13 (Positive Partition) Let A = {Inst, Axes, Label, A) be an axis graph. 
A set A = {ai, . . . ,an} of AxPREs defines a positive partition of A, denoted Va, iff 
Ui 'Pai ^-5 (I partition of Inst . D 
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The intuition behind the notion of positive partition from a set of AxPREs A = 



{«!, . . . , an} can be explained as follows. We know, by Definition 3.13, that each ctj in A 
defines an AxPRE partition which has positive classes and a unique empty neighbourhood 
class. In order for the set A to define a positive partition, the empty neighbourhood class 
of ai has to be further partitioned by some aj in A. In other words, when the entire set 
A is considered, every node that belongs to the empty neighbourhood of some Oj also 
belongs to some positive class of some aj. 

Example 3.9 (Positive Partition) Positive partitions play a key role in our frame- 
work. This requires a thorough understanding of the semantics of the AxPREs, and the 
partitions they define. We discuss now some particular cases of our running example of 



Figure 3.1 



Let us consider first the AxPRE t, which evaluated on each axis graph node will pro- 
duce as many different neighborhoods as there are different labels in the axis graph (each 
neighbourhood containing a single node). Since all nodes with bisimilar neighbourhoods 
will belong to the same class, if there are n different labels in the axis graph the e positive 



partition will contain n classes (Figure 3.8 shows below each SD node the sets of the 



partition for our running example). The same positive partition can be obtained with the 
set of expressions A = {[/i], . . . , \l-n\}, where /i, . . . , /„ are all the different node labels that 
appear in the axis graph. In our running example, the set of expressions equivalent to e 
would contain [interaction], [participant], etc. 

Let us consider now the AxPRE [participant]. The partition by [participant] is ob- 
tained as follows: for each node in the axis graph, we compute the AxPRE neighbour- 
hood corresponding to [participant], and all nodes with bisimilar neighbourhoods (i.e., all 
nodes that are [participant]-bisimilar) will belong to the same class. Thus, the partition 
will consist of two classes: one containing all the nodes v such that X{v) = participant, 
which is the set {4, 6, 18, 23, 28} (the positive class), and the other one with the remaining 
nodes (the empty neighbourhood class). On the other hand, the [-^participant] partition 
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will create as many classes as nodes v with labels X{v) ^ participant exist in Inst. In 
our running example, the [-^participant] partition will have nine positive classes (one per 
label different from "participant") whereas all nodes with "participant" label will belong to 
the empty neighbourhood class. The two AxPREs [participant] and [-^participant], when 
put together, define a positive partition with ten classes (one for each label). D 

3.3 Describing summaries with AxPREs 

In the previous sections, we have introduced the basic machinery we need to define 
summary descriptor (SD, for short). An SD is defined from an axis graph and a set of 
AxPREs. Intuitively, an SD consists of an axis graph in which each node has associated 
an AxPRE and a set in its AxPRE partition, and whose edges represent axis relationships 
between those sets. 

Definition 3.14 (Summary Descriptor) Let A = {Inst, Axes, Label, A) be an axis 
graph of an instance. A summary descriptor (SD for short) of A is a structure T>^ = 
(A, Q, axpre, extent) that consists of: 

• a set A = {ai, . . . , an} of AxPREs such that Va is a positive partition of A by A; 

• an axis graph Q = {Sum, Axes'^, Label, X^), called SD graph, representing axis 
relationships between nodes in the sets (extents) of the positive partition Va where: 

— Sum is a set of nodes; 

— Axes^ is a set of binary relations {-Ef , . . . , -E^} in SumxSum such that there 
is a tuple {sj,Sk) in Ef iff ^Ef G Axes,3v G extent{sj),3w G extent{sk) A 
{v, w) G Ef- (edges are labeled by axis names); 

— Label is the set of node labels from A; 

— X^ is a function that assigns labels in Label to nodes in Sum. 
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• a bijective function axpre that assigns AxPREs from A to nodes in Sum; 

• a bijective function extent that assigns a set from the positive partition V^ to each 
node in Sum (the set assigned is called the extent of the node). 

D 

An SD has some particular characteristics. The set A uniquely defines the extents 
of the SD, and therefore its nodes, for any particular axis graph instance. In other 
words, given an axis graph A and the set A we can create the SD of A by A. On the 
other hand, not any set of AxPREs define a positive partition and thus an SD. The 
first SDs we can distinguish are those that are defined by a unique AxPRE from those 
that have a multi-AxPRE definition. We denote the former ones as homogeneous SDs 
because all their nodes are defined uniformly. Homogeneous SDs are the most common in 
the summary literature (e.g., dataguides [GW97J, 1-index |MS99j . ToXin |RMni] . A(k)- 
index [KSBG02], F&B-Index |KBNKn2j . Skeleton |BCF+n5j ). SDs defined by multiple 
AxPREs are called heterogeneous. 

Definition 3.15 (Homogeneous and Heterogeneous SDs) When the extents of all 
nodes in a SD T> are defined with the same AxPRE a (i.e., |A| = 1), we say that the 
corresponding SD is homogeneous. In this case we say that V is an a SD. In contrast, 
if at least two different nodes are defined with different AxPREs (i.e., |A| > 1) we have 
a heterogeneous SD. D 

Proposition 3.1 Given an axis graph A, and a set A of AxPREs. If each a^ G A 
contains only AxPREs of the form [I], I G Label different from each other, such that there 
is an AxPRE for each label in A, then A defines an heterogeneous SD. Such an SD is 
denoted label SD. D 

Proof 3.1 It is easy to see that if A contains all the labels in A, each AxPRE [I] will 
create a positive class labeled Pi associated to a different SD node si such that all nodes 
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in A with label I will belong to the extent of s;. Since A contains all the labels in the 
document, the set P = IJj'^a '"^^^^ ^^ '^ partition of Inst. D 

Note that we need to know the instance in advance in order to define the set A 
accordingly. However, the label SD can also be defined by the AxPRE e, which makes 
the label SD homogeneous and its definition independent of the axis graph. The e SD 



will produce exactly the same equivalence classes that the set A of Proposition 3.1 



Example 3.10 (Summary Descriptor) Figure 3.8 shows a label SD for our running 
example. Since there are ten different labels in the axis graph of the instance, there are 
ten summary nodes in the label SD. Nodes in the figure are labeled by their AxPREs, so 
we are considering a heterogeneous label SD in which A contains an AxPRE per label. 
The extent of each node is depicted below it. Edges represent summary axis relations. 
For instance, there is an edge from S2 to siq labeled c, because there is a c edge in the 
axis graph from node 14 (in the extent of 82) to node 16 (in the extent of Siq). 

There are three kinds of edges in the figure, depending on properties of the sets that 
participate in the axis relation: dashed, regular, and bold. Dashed edges, like (s2, Sio) 
with label c, mean that some element in the extent of $2 has a child in the extent of siq. 
Regular edges, like (sq,sy) with label fc, mean that every element in the extent of sq has 
a first child in the extent of sj. (Since c includes fc, we do not draw a c edge when an 
fc edge exists.) Finally, bold edges, like (^4,^5) with label fc, mean that every element in 
the extent of s^ is a first child of some element in the extent of S4 and that every element 
in the extent of S4 has a first child in the extent of s^. The nodes and edges in the figure 
constitute the SD graph of the label SD. 

Figure [^fT^j shows another heterogeneous SD with a different set A where [participant], 



[expRoleList] and [expRole] from Figure 3.8 have been replaced by \participant].c.fc.ns* , 



[expRoleList].fc.ns* and [expRole].ns* , respectively. D 



Chapter 3. AxPRE summaries 



44 



[interactionList] 

{1} 



I- 



S2 

[interaction] 

__{2,14} 



fc^' 



f 

[names] 

{10,12,15,22, 
27,32,34} 



Sio S3' Sg 

[xref] [participantList] [experimentList] 

{15} {3,17} {13,35} 

fol Jc 



i)- 



[participant] 

ns,' {4.6J8,23,28} |^ 
S5 Se 

[interactorRef] [expRoleList] 

{5,7,19,24,29} {8,20,25^0} 






s? 
- [expRole] 

/ {9,11,21,26,k 
^^ 31,33} ^^ 



Figure 3.8: Label SD for the PSI-MI samples 
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Figure 3.9: A refined SD for the PSI-MI samples 
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Definition 3.16 (Summary Axis Stability) Lete = (sj, Sj) be an SD graph edge with 
label axis. We say that e is an existential edge if 3x G extent{si),3y G extent{sj) A 
{x,y) G axis, and a forward-stable edge ifWx G extent{si),3y G extent{sj) A {x,y) G 
axis, n 



Definition 3.16 captures the relationship between edges in the SD graph and the 



axis graph, and generalizes to several axes the edge stability representation in XSketch 



|PG06b] . Note that all forward-stable edges are also existential. In Figures 3.8 and 3.9 
existential edges are represented by dashed lines and forward-stable edges by solid lines. 
A dashed line does not necessarily mean that an edge is not forward-stable, it might be 



that stability has not been checked on that edge (existential edges in Figures 3.8 and 3.9 
have been checked and are not forward-stable). When an edge e and its inverse are both 
forward-stable, e is shown in bold lines. 



Algorithm 3J- computes an SD D from an axis graph A and a set X of AxPREs that 
define a positive partition of A. Essentially, the algorithm creates the positive partition 
in one pass over A (outer loop spanning steps 2-18). Loop 3-18 computes the AxPRE 
neighbourhood of v for each a in A (step 5) in order to find the a for which the AxPRE 
neighbourhood of v is non-empty. Since X defines a positive partition as a precondition, 
then for every v there is one and only one a in A such that Na{v) ^ 0. This guarantees 
that condition in step 6 is true exactly once for every v in A. 

The next task in the algorithm is to find the extent where v belongs. Loop 7-11 
compares by bisimulation Maiv) with every node in D that has the same AxPRE a. If 
there is a node s in D with a but the a neighbourhoods of v and s are not bisimilar 
(step 10), then a new node s is created and v is added to its extent (steps 12-16). The 
same happens if there is no s in D with a at all. Since each v m A may be in an axis 
relationship with nodes in any extent, the final loop 17-18 checks edge existence (for the 
input set of axes Axes^) between the node s such that v G extent{s) and every other 
node in D. The result of the algorithm is an SD D where each s in D has associated a 
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set in the positive partition of A by X and the axes in Axes^ satisfy the conditions in 
Definition 13.161 

As shown, loop 2-18 performs \Inst\ iterations. At any given moment, there is at 
most the same number of nodes in D as in A (each extent having only one node) and 
all have the same AxPRE. Therefore, loop 7-11 performs \Inst\ iterations in the worst 
case. Each iteration computes an AxPRE bisimulation (step 10) with time complexity 
0{m.log\Inst\) , where m is the total number of tuples (edges) in all axes in Axis. The 
worst case for loop 17-18 is the same as that of loop 7-11, so it also performs \Inst\ 



iterations. Thus, the total time complexity of Algorithm 3.1 is 0{\Inst\.m.log\Inst\). 



The notion of an AxPRE neighbourhood can also be defined for an SD graph, and 
it is called summary AxPRE neighbourhood of a node. Since an SD Graph is in fact 
an axis graph Q = {Sum, Axes'^ , Label, X^), for any given SD node s and AxPRE a 



we can define its SD graph automaton M.g{s) (Definition 3.4) and intersect it with the 



AxPRE automaton Ai^ (Definition 3.5) in order to obtain an AxPRE neighbourhood 



(Definition 3.7) of s. 



Definition 3.17 (Summary Neighbourhood) Let V^ = (A, Q, axpre, extent) be 
an SD, axis graph Q = {Sum, Axes^, Label, X^) its SD graph, s a node in Q , a an 
AxPRE, and M.g{s)r\M.a the intersection automaton of M.g{s) and Aia- The summary 



neighbourhood of s by a, denoted Af^ {s) , is the subgraph of Q as in Definition 3.7 D 



Definition 3.18 (Partition Refinement) Let A = {Inst, Axes, Label, A) be an axis 
graph. If Va and Vs, are positive partitions of A, Va is a partition refinement of Vm if 
every set o/Pa is contained in a set ofVn- □ 

Definition 3.19 (SD Refinement) Let A = {Inst, Axes, Label, A) be an axis graph 
and Va = {A, Q, extent) and V^ = {M,Q', extent') be two SDs of A. Va is an SD 
refinement ofV^ ifVA is a partition refinement ofV^- □ 
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Algorithm 3.1 

createSD{A, X) 

Input: An axis graph A, a set X of AxPREs that defines a positive partition of A, and 
a set Axes^ of SD axes where each axis contains only the empty tuple 
Output: An SD D 

1: create empty SD D 

2: for every v in A do 

3: candidate := 

4: for every a in X do 

5: compute the a neighbourhood of v: Maiv) 

6: ifJ\fa{v) 7^0 then 

7: for every node s in D such that axpre{s) := a do 

8: let w be a node in extent{s) 

9: compute the a neighbourhood ofw: Afa{w) 

10: ifv r^"" w (i.e., Ma{v) ~ A/'a(w)j then 

11: candidate := s 

12: if candidate = then 

13: create a new node candidate in D 

14: axpre(candidate) := a 

15: X^ {candidate) := X{v) 

16: add V to extent{s) 

17: for every node s' ^ s in D do 

18: add tuple {s, s') and {s', s) to the corresponding axis in Axes^ if conditions 



in Definition 3.16 are satisfied 
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Figure 3.10: [participant]. c. fens* neighbourhoods of Figure 3.8 (a) and Figure 3.9| (b) 
SDs 

Proposition 3.2 Let A = {Inst, Axes, Label, A) be an axis graph, a and (3 be AxPREs, 
and Va and Vp be AxPRE partitions of A. If a is contained in (3 then V/3 is a refinement 
ofV^. U 

Proof 3.2 (sketch) The proof follows from the notion of AxPRE neighbourhoods. If a 
is contained in (3 then for any given node v, its a neighbourhood is contained in its j3 
neighbourhood. Consequently, two nodes that are not distinguished by a (i.e., they are 
a-bisimilar) may be distinguished by (3, but not the other way around. This guarantees 
that (3 creates either the same partition as a or a refinement. D 

Corollary 3.1 Let A = {Inst, Axes, Label, A) be an axis graph and T>^ = (A, Q, extent) 
and Pb = {M , G' , extent') be two SDs of A. If every /5 G B zs contained in some a E A 
then V^ is an SD refinement ofV^. D 



Example 3.11 (SD Refinement) Let us consider the label SD of Figure 3.9. Recall 
that in the label SD, A = {[h], ..., [In]}, where li G Label, U 7^ Ij ^ i,j, and [J- U = Label. 
Suppose we want to refine node S4. For this node, the partition represented in the figure 
was produced by the AxPRE [participant]. If we replace this AxPRE by [participant]. c 
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in A, and apply this set of AxPREs to Inst, two nodes will he produced, let us call these 
nodes S41 and S42, with extents {4} and {6,18,23,28}, respectively (s^ will not appear 
because the AxPRE which produced it was replaced by the new one). This occurs because 
node 4 in the axis graph has one child (namely interactorRef) while the other four nodes 
have two children each (interactorRef and expRoleList) . Thus, applying [participant]c 
we obtain two different AxPRE neighbourhoods, plus the empty neighbourhood, which is 
itself partitioned by the remaining AxPREs. 

Analogously, if we want to refine the extent of S42 further using the AxPRE ens, 
we will replace the AxPRE [participant]. c by [participant]. ecus. This will produce three 
sets, with extents: {4}, {6,28}, {18,23}. 

Finally, suppose now that the label SD is defined using A = e, and we want to refine 
node S4 with [participant]. c. In this case, just adding the new AxPRE does not suffice, 
because we would not obtain an SD: the union of positive partitions will not be a partition 
of Inst because e will still produce its own partitions. We solve this adding the AxPRE 
[-^participant] , which will produce the remainder of the label SD and will send all nodes 
labeled participant to the empty neighbourhood class. D 

The notions of partition and SD refinement, besides describing the axis structure 
of an axis graph, allows us to define a hierarchy of SDs. This provides the basis for 
recognizing a lattice among different SDs, where each node corresponds to a different 
AxPRE definition. We will show that this lattice covers all the summaries addressed in 
the literature, plus more complex new ones. At the top of this hierarchy (i.e., the coarsest 
partition), the empty AxPRE defines a SD where each node is partitioned by label (as 



shown in Figure 4.1), a typical summary found in the literature |CM94t INUWC97] . 
The bottom of the lattice may vary, although the finest partition granularity can be 
represented by the expression {fens*)*, that produces a partition in which each node in 
the axis graph will belong to a different equivalence class. 
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Definition 3.20 (DescribeX Lattice) A DescribeX lattice with respect to a set of axes 
A = {oi, . . . , an} is defined as follows: each node corresponds to an AxPRE generated by 



the grammar of Definition 3^3 when the terminal axis is one of ai, ...,an- Also, there is 
an edge {ni,n2) in the lattice if and only if the AxPRE of n2 is contained in the AxPRE 
ofni. D 



From Definition 3.20 it follows that the coarsest partition that the lattice may define 



is the label SD. The finest partition depends on the chosen set of axes. 

This chapter provided an overview of the DescribeX framework, including the AxPRE 
language and some fundamental notions like neighbourhood, bisimilarity, and summary 
descriptor (SD). In the next chapter we will discuss how the DescribeX lattice captures 
and generalizes many proposals in the literature. 



Chapter 4 



Capturing earlier literature 
proposals with DescribeX 



DescribeX summaries can be classified in a lattice that describes a refinement relationship 



between entire summaries (Definition 3.20 ). In this chapter we revisit some of the related 
work discussed in Chapter [2] that can be captured in such a lattice by the DescribeX 
framework. 



Figure |4.1| shows a fragment of a DescribeX summary lattice that captures earlier 
proposals based on the notion of bisimilarity (in green) and ad-hoc constructions (in 
red). Each node in the figure corresponds to a homogeneous SD defined by an AxPRE. 
DescribeX not only captures most summary proposals but also provides a declarative 
way of defining entirely new ones: nodes and edges in blue are a sample of the richer 



SDs that were never considered in the literature, like the one that appears in Figure 3.9 
(cfcns*) and in Chapter Is] (p*|c./s). 



4.1 Bisimilarity-based proposals 

The earliest bisimilarity-based summary proposal is the family presented in |MS99j . which 
contains a p* summary: the 1-index. The 1-index partition is computed by using bisim- 
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Figure 4.1: AxPRE summary lattice capturing earlier homogeneous proposals 



ulation as the equivalence relation. The F&B-Index |KBNK02] . is an example of a (p|c)* 
SD. The F&B-Index construction uses bisimulation like the 1-index, but applied to the 
edges and their inverses in a recursive procedure until a fix-point. With this construc- 
tion, the F&B-Index's equivalence classes are computed according to the incoming and 
outgoing label paths of the nodes. The same work introduces the F+B- index (a p*\c* Ax- 
PRE summary constructed by applying bisimulation to the edges and their inverses only 
once) and the BPCI(k,j,m) index (a {p^\c^)'^ AxPRE summary, where k, and j controls 
the lengths of the paths and m the iterations of the bisimulation on the edges and their 
inverses). The F-|-B-index and the F&B-index are BPCI(oo, cxo, 1) and BPCI(cx3, cxo, cxo) 
respectively. The A(k)-index |KSBG02] is a p^ AxPRE summary based on /c-bisimilarity 
(bisimilarity computed for paths of length k). Thus, the A(0)-index is a label SD, the 
A(l)-index is a p SD, the A(2)-index is a p.p SD, and the A(h)-index is the p^ SD. We 
discuss some of these proposals in more detail. 
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Unlike standard definitions in the bisimulation literature |PT87t IDPP04J . 1-index, 
A(k)-index, F&B-index, and BPCI(k,j,ni) use a bisimulation defined backwards in order 
to capture incoming paths to the nodes. We provide next a definition of backwards 
bisimulation and bisimilarity for completeness. In the literature, the only axes considered 
are c and idref . 

Definition 4.1 (Backwards Bisimulation and Bisimilarity) Let Qi and Q2 be two 

rooted subgraphs of an axis graph A = {Inst, Axes, Label, X), such that Axesg^ C Axes 
and Axesg^ C Axes, and let ri,r2 G Inst be the roots of Qi and Q2 respectively. A 
backwards bisimulation between Qi and Q2 is a symmetric relation ^b such that for all 

V G Gi, w G Q2, -Ej G Axesg^, and E^ G Axesg^: if v ~b w, then \{v) = X{w); if 

V ~b w, and {v',v) G Ef^, then {w',w) G Ef'^ and v' ~fc w' . Two nodes v G Gi, w G G2 
are backward bisimilar, in notation v ~;, w, iff there exist a backwards bisimulation ^b 
between Gi and G2 such that v ^b w. D 

It is easy to see that the backwards bisimulation is an equivalence relation. The 
F&B-Index construction uses backwards bisimulation like the 1-index, but applied to c 



and idref edges and their inverses. Algorithm 4.1 computes the equivalence classes for 



the F&B-Index according to both incoming and outgoing label paths of the nodes. 

Proposition 4.1 Let G be an axis graph with Axes = {c} (or Axes = {c, idref}). The 
F&B-index of G is a {p\c)* SD (or a {p\c\idref\idref~^)* SD). D 



Proof 4.1 The input data graph used in the F&B-index construction (Algorithm 4-1) 
can be viewed as an axis graph with the c axis, in which the reversed edges of lines 4 and 
6 correspond to the c^^ axis (equivalent to a p axis). Therefore, for simplicity, instead 
of reversing edges we use an axis graph G with Axes = {c} and take its inverse when 
necessary. If id-idrefs are considered, then Axes = {c, idref}. 
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Algorithm 4.1 

F&B—construction{G) 

Input: Data graph G 
Output: F&B-index I 

1: let V he a partition of the nodes in G 

2: V ^- label SD partition of G 

3: repeat 

4: reverse all edges in G 

5: V ^- compute the backwards bisimilarity partition of G initializing the computa- 
tion with V 

6: reverse all edges in G, obtaining the original G 

7: V ^- compute the backwards bisimilarity partition of G initializing the computa- 
tion with V 

8: until V does not change (fix point) 

9: for each equivalence class Pi (^V do 
10: create an index node s E I 
11: extent{s) ^- Pi 
12: for each edge from v to w in G do 
13: let s E I be an index node such that v G extent{s) 
14: let s' E I be an index node such that w G extent{s) 
15: if there is no edge from s to s' then 
16: create an edge from s to s' 
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Let us consider first the case of Axes = {c}. We start with the label SD in Line 2, 
which is an e SD. Lines 4 and 5 are equivalent to refining all nodes in the initial e SD by 
the c* AxPRE. This produces a c* SD. Then, lines 6 and 7 produce a refinement of all 
c* nodes by the p* AxPRE, thus obtaining a c*.p* SD. The iterative process until the fix 
point can be represented in our framework as a Kleene closure of the c*.p* AxPRE, which 
yields a {c*.p*)* SD. It is easy to see that AxPRE [c* .p*)* produces the same SD as {p\c)* 
(by identity of regular expressions) . The remainder of the algorithm (lines 9-16J creates 



existential edges like in Definition 3.16 



When Axes = {c,idref}, the argument is similar but with AxPREs {c\idref)* and 
{p\idref )* instead of c* and p* , respectively. In this case, the final AxPRE for the SD 
is {p\c\idref\idref^^)*. D 

The notion of k-bisimilarity used in the A(k)-index was defined to capture incoming 
paths on c and idref edges of length up to fc. We provide next a more general definition 
for axis graphs that supports paths on all types of axes. 

Definition 4.2 (Backwards k-Bisimulation and k-Bisimilarity) Let Qi and Q2 be 

two rooted subgraphs of an axis graph A = {Inst, Axes, Label, X), such that Axesg-^ C 
Axes and Axesg^ C Axes, and let ri,r2 G Inst be the roots of Qi and Q2 respectively. 
A backwards k-bisimulation between Qi and Q2 is a symmetric relation ^^ such that for 
all V G Gi, w G Q2, Ef^ G Axesg-^, and Ef'^ G Axesg.^: if v ~° w, then \{v) = X{w); if 
V ~^ w, and {v',v) G Ef'^, then {w',w) G E^ and v' ^l~^ w' . Two nodes v G Gi, w E G2 
are backward k-bisimilar, in notation v ~^' w, iff there exist a backwards k-bisimulation 
~^ between Gi and G2 such that v ^\ w. D 

Note that backwards k-bisimilarity defines an equivalence relation on the nodes in 
the axis graph. The partition created by the backwards k-bisimilarity corresponds to 
the A(k)-index, where fc is a parameter that represents the length of the incoming paths 
summarized by the index. 
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Algorithm 4.2 

BPCI—construction{G, kin, kout,td) 

Input: Data graph G, local similarities kin and kout, tree depth td 
Output: BPCI{kin,kout,td) I 

1: let V he a partition of the nodes in G 

2: V *— label SD partition of G 

3: for i=l to td do 

4: reverse all edges in G 

5: V ^- compute the backwards A;j„-bisimilarity partition of G initializing the com- 
putation with V 

6: reverse all edges in G, obtaining the original G 

7: V ^- compute the backwards /cout-bisimilarity partition of G initializing the com- 
putation with V 

8: for each equivalence class Pi eV do 

9: create an index node s E I 
10: extent{s) ^- Pi 
11: for each edge from v to w in G do 
12: let s E I be an index node such that v G extent{s) 
13: let s' E I be an index node such that w G extent{s) 
14: if there is no edge from s to s' then 
15: create an edge from s to s' 
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Proposition 4.2 Let G be an axis graph with Axes = {c} (or Axes = {c, idref}). The 
A(k)-index of G is a p^ SD (or a {j)\idrefY SD). D 

Proof 4.2 Consider an axis graph G with Axes = {c}. Two nodes v,w belong to the 
same extent in the p^ SD iff' they are p^-bisimilar. In addition, we know that v ^pk w 
iff there exists neighbourhoods N'pk{v) and Mpk^w) such that v ^ w. This means we can 
define a backwards k-bisimulation ~^ between J\fpk{v) and Npk{w) such that v ^^ w and 
thus V ^\ w. n 

The BPCI(A;j„, fcoMt, t(i)-index is another proposal based on the notion of backwards 



IS 



k-bisimulation. Algorithm 4.2 constructs a BPCI(A;m, fcoMt,^c?)-iiidex. Algorithm 4.2 
similar to Algorithm 4.1 but uses /cj^j-bisimilarity for the reversed edges (line 5), kout- 
bisimilarity for the original edges (line?), and a td number of iterations instead of a fix 
point (lines 3 to 7). 

Proposition 4.3 Let G be an axis graph with Axes = {c} (or Axes = {c, idref}). The 
BPCI(kin, k out, td) -index of G is a {p^i^\c^outYd gj^ ^^^ ^ {p^^^\c^°^^\idref'^°''^\{idref~'^Y^"y'^ 
SD). D 



Proof 4.3 Like for F&B-index construction (Algorithm 4-1) the input data graph G can 
be viewed as an axis graph with the c axis, in which the reversed edges correspond to the 
c~^ (or p) axis. If id-idrefs are considered, then Axes = {c, idref}. 

Let us consider first the case of Axes = {c}. Lines 4 and 5 are equivalent to re- 
fining all nodes in the initial e SD (line 2) by the c'^""* AxPRE. This produces a c'^°"* 
SD. Then, lines 6 and 7 produce a refinement of all c^°^^ nodes by the p^^" AxPRE, thus 
obtaining a c'^°^Kp''^" SD. The iterative process is repeated td times, which is equivalent 
to a (^c^°-ut pkinyd g£) ^gdin, by identity of regular expressions (c'^o^t.p'^™)*'^ ^5 equiva- 
lent to as (^c^out\^pki„yd ^ rpf^^ remaining of the algorithm (lines 9-16J creates existential 



edges like in Definition 3.16. When Axes = {c, idref}, the final AxPRE for the SD is 



^k^ut I/- i^^^gyfco^t I {^idref-^f^-^y^. D 
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The Skeleton summary [BCF+OSj clusters together nodes with the same subtree struc- 
ture, thus capturing node ordering in subtrees. Skeleton uses an entirely different con- 
struction approach, but its essence can be captured by the (fens*)* AxPRE. 

The D(k)-index |QLO03| , and M(k)-index |HY04j are heterogeneous SD proposals. 
All nodes Si are described by p^ AxPREs with a different k per Sj. They use different 
construction strategies based on dynamic query workloads and local similarity (i.e., the 
length of each path depends on its location in the XML instance) to determine the subset 
of incoming paths to be summarized. 

XSketch |PG06b] manages summaries capturing many (but not all) heterogeneous 
SD's along the p and c axis, ranging from the label summary to the F&B-Index. How- 
ever there is no control over the refinements chosen, nor a description of the intermediate 
summaries obtained. This makes sense given that XSketch objective is to provide selec- 
tivity estimates. As such, its construction algorithm is guided by heuristics to optimize 
the space/accuracy trade-off. 

4.2 Ad-hoc construction proposals 

Region inclusion graphs (RIGs) |CM94j and representative objects of length 1 (1-RO) 
|NUWC97] are label SDs, that is e SDs (because all their nodes s, are described by the e 
AxPRE). In general, representative objects are p^ SDs for XML tree instances. Therefore, 
the 1-RO is a label SD, the 2-RO is a p SD, the 3-RO is a p.p SD, and the FRO (full 
representative object) is the p* SD. 

Dataguides JGW97] group instance nodes into sets called target sets according to 
the label paths from the root they belong to. The dataguide construction is basically 
a nondeterministic-to-deterministic automaton translation. When the data instance is 
a tree, the dataguide's target sets are equivalent to the extents in our framework: a 
dataguide of an XML tree is a p* SD. 
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ToXin |RM01j also has a component that can be viewed as an p* SD. ToXin consists 
of three index structures: the ToXin schema, the path index, and the value index. The 
ToXin schema is defined only for tree instances, and it is equivalent to a p* SD graph. 

In this chapter, we discussed how DescribeX uses AxPREs to capture many summary 
proposals in the literature by providing a declarative definition for them for the first time. 
In the next chapter, we will show how SDs can be declaratively updated by means of two 
basic operations, refinement and stabilization applied to neighbourhoods. 



Chapter 5 



Describing extents and 
neighbourhoods 



We have seen that several SD nodes can share the same AxPRE a. The reason for this is 
that each SD node with the same a corresponds to a different extent in the a partition. 
In the first section of this chapter, we provide mechanisms for describing each extent in 
the partition based on neighbourhoods, sets of axis label paths, and AxPREs. 

The description provided by a node in the SD can be changed by an operation that 
modifies its AxPRE and thus the AxPRE neighbourhood of the nodes in its extent. 
When the new AxPRE partition thus obtained constitutes a refinement of the old one, 
the operation is called an AxPRE refinement. The notion of refinement is tightly related 
to that of stabilization. An edge stabilization determines the partition of an extent into 
two sets based on the participation (total or partial) of the extent nodes in the axis 
relation the edge represents. In the second section of this chapter, we discuss in detail 
our approaches to refinement and stabihzation based on AxPREs. 
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Figure 5.1: The two [interaction].c[participantList].{c\p) neighbourhoods (a) and their 
representative neighbourhood (b) from our running example 

5.1 Concise descriptions 

Since several SD nodes can share the same AxPRE, we need a mechanism for uniquely 
describe each SD node and its extent. The most straightforward way to do that would 
be just to list all nodes that belong to the extent (extensional definition). A more concise 
description is provided by the a neighbourhood of any node in the extent. Since all nodes 
in an extent are bisimilar, any a neighbourhood can be used to find all the other nodes 
in the extent by bisimulation. 

In order to get the most concise description, we need to find the smallest (in terms 
of number of nodes) neighbourhood in the extent of s that is bisimilar to all the others. 
We can do this by computing a bisimulation contraction over all neighbourhoods in the 
extent of s. The bisimulation contraction of a given graph is the smallest graph that is 
bisimilar to it, which can be computed in time 0{mlogn) (where m is the number of 
edges and n is the number of nodes) JPT87J . or even linearly for acyclic graphs JDPP04] . 
Based on bisimulation contraction we define the notion of representative neighbourhood. 

Definition 5.1 (Representative Neighbourhood) Let V be an SD and s a node in 
D such that axpre{s) = a. The representative neighbourhood of s for a, denoted TZa{s), 
is an axis graph that is the bisimulation contraction of all neighbourhoods Ma{vi), where 
Vi G extent{s). lZa{s) has a single root node vq that is bisimilar to all Vi G extent{s). D 
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Note that the bisimulation contraction is not necessarily one of the neighbourhoods in 
the extent - it could be smaller than any of them. Rather, a representative neighbourhood 
is an entirely new axis graph that happens to be the smallest that is bisimilar to all 
neighbourhoods in an extent. 

Example 5.1 (Representative Neighbourhood) Consider the AxPRE partition of 
our running example described by AxPRE [interaction]. c[participantList].{c\p). It has 



only one set containing nodes 2 and 14, whose neighbourhoods are shown in Figure 5.1 (a). 



Its representative neighbourhood Tl[interaction].c[participantList].{c\p){s) is the graph shown in 



Figure 5.1 (b). Note that such a neighbourhood does not belong to the extent of s (there 



is no participantList in the axis graph with only one participant node). D 

For some neighbourhoods, deciding bisimilarity is equivalent to comparing the sets 
of simple label paths from their roots to their leaves. (A path is simple when it has no 
repeated edges.) In those cases, neighbourhoods can be described by an extent expression 
(EE for short), which is capable of computing precisely the set of elements in the extent of 
a given SD node and functions like a virtual view. In Chapter [6] we provide a mechanism 
for expressing EEs in XPath. 

Definition 5.2 (Path and LPath Sets) Let M be a neighbourhood in an axis graph 
A, and V a node in M . We denote by Path{v) and LPath(v) the set of simple axis paths 
and simple axis label paths from v, respectively. D 



Example 5.2 Consider the neighbourhoods of Figure 5.1 (a). The Path and LPath sets 
are defined as follows: Path(2) = {c, c.c, c.p} = Path{14:), and LPath{2) = {c[participant 
List], c[participantList].c[participant], c[participantList].p[interaction]} = LPath{14). 
Note that both sets include all the prefixes. D 

If deciding bisimilarity between a given set of neighbourhoods is equivalent to com- 
paring their LPath sets, we say that such neighbourhoods are LPath distinguishable. 
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Figure 5.2: Two [expRoleList]. c. f [expRole] neighbourhoods from our running example 

Definition 5.3 (LPath Distinguishable) Let N'l^vi) ,..., Afmivm) be neighbourhoods 
in an axis graph A. We say that A/i, . . . , A/'m are LPath distinguishable when, for all 
l<ij <m: Uiivi) ~ Mjivj) iff LPath{vi) = LPath{vj). D 

Although the axis graph neighbourhoods we have considered so far are all LPath 
distinguishable, some combination of axes may produce neighbourhoods that are not, as 
illustrated by the next example. 

Example 5.3 (LPath Distinguishable) Consider the two acyclic neighbourhoods of 
Figure 5.4 which correspond to nodes 25 and 30 in Figure 3.1\ respectively. Both neigh- 
bourhoods have the same LPath set {c[expRole],c[expRole].f [expRole]}. However, it is 
easy to see they are not bisimilar: node 33 in neighbourhood (b) has c and f incoming 
edges, whereas all expRole nodes in neighbourhood (a) have either a c or an f edge, but 
not both. Thus, they are not LPath distinguishable. 



In contrast, the three cyclic neighbourhoods of Figure \5 . 1\ are all bisimilar and have the 
same LPath set {c[participantList],c[participantList].c[participant], c[participantList].p 
[interaction]} . Therefore, they are all LPath distinguishable. D 

We are interested in LPath distinguishable neighbourhoods because they can be de- 
scribed by EEs. In general, determining whether a given set of neighbourhoods is LPath 
distinguishable entails computing the bisimulation between them and then comparing 
the result to their LPath sets. 
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There is a class of neighbourhoods, however, that are guaranteed to be always LPath 
distinguishable. For neighbourhoods in that class, we can bypass the bisimulation com- 
putation and obtain the EEs directly from the LPaths sets. Such is the class of the tree 
neighbourhoods. How to characterize other classes of LPath distinguishable neighbour- 
hoods without resorting to bisimulation remains an open problem. 

We will show below that tree neighbourhoods are in fact LPath distinguishable 



(Proposition 5.1). In order to do that, we need first some auxiliary results. 



Lemma 5.1 // two neighbourhoods Mi and A/2 are bisimilar then there exists a labeled 
bisimulation ~ such that every node in both graphs is in ^. D 

Proof 5.1 By definition, in order for Mi and M2 to be bisimilar ri and r2 have to be 
bisimilar, where ri and r2 are the roots of Mi and M2 respectively. Thus, there has to 
be a labeled bisimulation k, such that ri ~ r2. In addition, all nodes in Mi connected 
to Ti by an edge with label a have to be in the labeled bisimulation with all nodes in M2 
connected to r2 by an edge with label a (also by definition). This means that every node 
connected to either ri or r2 by an edge have to belong to ^. Since every node in Mi and 
A/2 is reachable from ri and r2 respectively, we can prove inductively that every node in 
both Ml and M2 belong to ~. D 

Corollary 5.1 For all leaves v G Mi and w G M2: v ^ w iff X{v) = X{w). D 



Proof 5.2 By Definition 3.S\ if v ^ w then there exist a labeled bisimulation ~ between 



Ml and A/2 such that v ^ w, which means that X{v) = X{w). We need to prove now 
that leaves having the same label are bisimilar. It is easy to see from Definition 3.S\ that 



there always exists a labeled bisimulation between leaves in Mi and A/2 when they have 
the same labels. Consequently, if X{v) = X{w) then f ~ w. D 

Proposition 5.1 Let Mi and A/2 be tree neighbourhoods in an axis graph A. Then, 
Mi{v) r^M2{w) iffLPath{v) = LPathiw). D 
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Proof 5.3 We proceed by induction on the length of an arbitrary outgoing path. For the 
base case, we have that v and w are leaves of Mi and M respectively. By Corollary 5.1 



V r^ w iff \{v) = \{w). Since they are leaves, LPath{y) = LPath{w) = $, so v r^ w iff 
X{v) = X{w) and LPath{v) = LPath{w). 

For the induction step, consider nodes v G A/i and w G Af2 and all edges from them 
with label "axis": {v,Vi), 1 < i < n and {w,Wj), 1 < j < m. We know that, if there is 
a Vk that is not bisimilar to any Wj, i.e., Vk ^/^ Wi, . . ., v^ i^ Wm, then by Definition 3.9 



V rf w. We need to prove that the latter statement is equivalent to the following: if there 
is a Vk whose label (or LPath set) is different from the label (or LPath set) of every Wj, 
then LPath[v) ^ LPath{w). 

By inductive hypothesis, Vk 7^ Wj iff \{vk) 7^ A(wj) or LPath{vk) 7^ LPath{wj). Note 
that, edges {v,Vi) and {w,Wj) add prefixes "axis[X{vi)]" and "axis[X{wj)]" to each string 
in LPath{vi) and LPath{wj) respectively. For a given node v , let us call preLPath{v) the 
set of strings in LPath{v) prefixed with "axis[X{v)] ". It is easy to see that, given any two 
nodes Vk and wi, if the original set of string are different {LPath{vk) 7^ LPath{wi)) , then 
the strings with the prefixes are going to be different {preLPath{vk) 7^ preLPath{wi)) , no 
matter what the prefixes are. In addition, if X{vi) 7^ X{wi), we have that preLPath{vk) 7^ 
preLPath{wi) even when LPath{vk) = LPath{wi) (because the label of the nodes are 
included in the prefixes). 

Since LPath{v) contains all label paths from v, in particular it contains all those 
that begin with "axis" ([J-preLPath{vi)) C LPath{v)). Similarly, [J-preLPath{wj) C 
LPath{w) . However, if there is a Vk such that either its label or its LPath set is different 
from those of every Wj, then [J^preLPath{vi) 7^ [J-preLPath{wj). Since all label paths 
in LPath{v) that are not in [J^preLPath{wi) begin with a prefix different from "axis", 
we conclude [J^preLPath{vi) 7^ [J- preLPath{wj) =^ LPath{v) 7^ LPath{w). D 
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Notation. Let s be a node in an SD T) wliose extent contains only LPatli distinguish- 
able neighbourhoods. We denote by Path{s) and LPath{s) the set of all different axis 
paths and axis label path, respectively, from the nodes in the extent of s. That is, 
LPath{s) = [J^LPath{vi), Vi G extent{s). 

When dealing with LPath distinguishable neighbourhoods, the LPath set can be an 
alternative way of representing an extent: just compute the representative neighbourhood 
TZa{s) of a given SD node s and then take LPath{s). However, checking containment 
and equivalence from the LPath sets is cumbersome, so we would like to have a way 
of obtaining an AxPRE from an LPath set that provides a concise description of the 
representative neighbourhood and thus of all nodes in a given extent. We will denote 
this new expression extent AxPRE. 

Definition 5.4 (Extent AxPRE) Let V he an SD, s a node in V and a its AxPRE. 
An extent AxPRE a' of s is an AxPRE such that all nodes in the extent of s have a' 
neighbourhoods and a' is different from all other extent AxPREs inV. D 

It is important to note that extent AxPREs can only be defined when representative 
neighbourhoods are not pairwise in an inclusion relationship. Because of the prefix 
semantics we use, if for any two representative neighbourhoods 7?.q, and TZ'^ we have that 
TZa ^ T^'a then any possible AxPRE for TZ'^ will also return TZa, and consequently it will 
not be an "extent" AxPRE. 

The extent AxPRE of an SD node s can be constructed from the representative 
neighbourhood TZa{s) by taking the label of the root v of TZa{s) and concatenating it 
with the disjunction of the axis label paths of v. That is, the extent AxPRE a' of s is 
one of the foUowings: 

• [\{v)].{lpi\lp2\ ■ ■ ■ \lpn) if LPathiv) = [jjpi 

• [X{v)].lp a LPath{v) = {lp} 
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Figure 5.3: Two [participant\.c.fc.ns* neighbourhoods (a) and their representative neigh- 
bourhood (b) from our running example 

• [X{v)] if LPath{v) = 

It easy to see from the construction that all nodes in the extent of s will have a' 
neighbourhoods. 



Example 5.4 (Extent AxPRE) Consider the two neighbourhoods of Figure 5.3 (a) 
from our running example. They are tree [participant].c.fc.ns* neighbourhoods of ele- 
ments 6 and 28, respectively. In this case, the bisimulation contraction of both neighbour- 
hoods is an axis graph isomorphic to them and appears in Figure 5.3| (b). Since the label of 
both nodes 6 and 28 is participant, the extent AxPRE begins with the prefix [participant] . 
In addition, LPath{6) = {c[interactorRef],c[expRoleList],c[expRoleList].fc[expRole], 
c[expRoleList].fc[expRole].ns[expRole]} = LPath{28) , which means that the AxPRE 
contains a conjunction of four subAxPREs, resulting in [participant]. {c[inter actor Re f]\c 
[expRoleList] \ c[expRoleList] .fc [expRole] \c[expRoleList] .fc [expRole] .ns [expRole] ) . D 



According to Definition 3.16| (summary axis stability), forward- stable edges provide 



stronger information on the axis relationship that nodes in their extents satisfy: from a 
forward-stable edge {si, Sj) labeled axis, we know that all nodes in the extent of Si are 
related by axis to some nodes in the extent of Sj. Thus, we are particulary interested in 
neighbourhoods in which all edges are forward-stable for their descriptive capabilities. 
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Figure 5.4: The c.fc.ns* neighbourhood of node S2 of Figure 3.8 



Definition 5.5 (Forward-stable Neighbourhood) A forward-stable neighbourhood 
of an SD node s is a neighbourhood of s with all its edges forward-stable. D 

An AxPRE always describes some neighbourhood in an axis graph, either of an in- 
stance or an SD. When an AxPRE describes a forward-stable neighbourhood in the SD 
graph, it is called a neighbourhood AxPRE. If all edges in the a neighbourhood of SD 
node s are forward-stable, the extent AxPRE of s can be computed from them rather 
than from the axis graph of the instance. 



Example 5.5 (Neighbourhood AxPRE) Consider node S2 in Figure 5.4' Its current 
AxPRE is [interaction], which means that its extent contains only interaction elements. 
We can infer from the SD graph an neighbourhood AxPRE as follows. Since edges {s2, s^) , 
{s2, Sg), and {s^, S4) are forward-stable, we could write an AxPRE that expresses those re- 
lations, which is [interaction]. {c[participantList].fc[participant]\c[experimentList]). Such 
an AxPRE tells us that not only the extent of $2 contains interaction elements, but more 
precisely they also have nested elements such as a participantList with a nested partici- 
pant, and an experimentList. D 
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5.2 Refinement and stabilization 

The description provided by a node in the SD can be changed by an operation that modi- 
fies its AxPRE and thus its AxPRE neighbourhood. This operation is called a refinement 
of an SD node. The refinement of an SD node can be computed directly by changing 



the AxPRE of the node (Algorithm 5.1) or by stabilizing a summary neighbourhood for 



a given AxPRE (Algorithm 5.5). Note that Algorithm 5.1 in fact changes one of the 
AxPREs in the definition of the SD, so all nodes that share the modified AxPRE will be 
affected. 

Previous proposals perform global refinements on the entire SD graph |KBNK02l 
IKSB(;n2j or local refinements based on statistics or workload [QLUOSj IHYn4[ EG06b], 



without the ability to refine a declaratively defined neighbourhood. In contrast, using 
DescribeX we can precisely characterize the neighbourhood considered for the refinement 
with an AxPRE. 

DescribeX refinements can also be based on the notion of summary axis stability 



(Definition 3.16). The goal of this particular refinement operation is to make all edges 
of a neighbourhood, given by an AxPRE in the SD graph, forward-stable. Edges can 
be stabilized one at a time or by groups with the same axis. For the former approach, 
DescribeX implements two different strategies. If the edge links two different nodes, then 
Algorithm 5.2| is invoked. In contrast, if the edge forms a loop, then Algorithm 5.4 is used. 



For stabilizing a group of edges with the same axis from a given node, DescribeX invokes 



Algorithm 5.3 All algorithms mentioned above reduce edge stabilization to refinement: 



step 1 in each algorithm composes a new AxPRE and step 3 refines the affected nodes 



by calling Algorithm 5.1 



The next two examples illustrate how a non forward-stable edge is stabilized by 
Algorithms 5^ and 5.4[ respectively. 
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Algorithm 5.1 

refineNode{D , s, a) 

Input: An SD D, a node s in D, and an AxPRE a C axpre{s) 
Output: An SD D where s has been refined by a 

1: for every v in extent{s) do 

2: candidate := 

3: compute the a neighbourhood of v: Ma{v) 

4: for every node s in D such that axpre{s) := a do 

5: let w be a node in extent{s) 

6: compute the a neighbourhood of w: Afaiw) 

7: ifv^'^w (i.e., Maiy) ~ Afa{w)) then 

8: candidate := s 

9: if candidate = then 
10: create a new node candidate in D 

11: axpre{candidate) := a 

12: X^ {candidate) := \{v) 

13: move v from extent{s) to extent{candidate) 
14: let S be the set of nodes connected to s 
15: for every node s' in S do 
16: add edges {candidate's') and {s' , candidate) if conditions in Definition 3.16 are 

satisfied 
17: delete s and all its incoming and outgoing edges from D 
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Algorithm 5.2 

stabihzeEdge{D , Sj, Sj] 



Input: An SD D containing a non forward- stable edge e = (sj, Sj) with label axis 
Output: An SD D where e has been replaced by forward- stable e' = {s[, Sj) 

1: a := ax'pr-e{si)\axis axpre{sj) 

2: for every node s in D such that axpre{s) = axpre{si) do 

3: refineNode{D, s, a) 



Algorithm 5.3 

stabilizeAxis{D , Si, axis) 

Input: An SD D containing a non forward- stable edge from Sj with label axis 
Output: An SD D where all axis edges from Si are forward- stable 

1: a := axpfr-e{s,i)\axis 

2: for every node s in D such that axpre{s) = axpre{si) do 

3: refineNode{D, s, a) 



Algorithm 5.4 

unfoldEdge{D , Si, axis) 

Input: An SD D, a node Sj such that there exists a non forward- stable e = (sj, Si) with 

label axis 

Output: The SD D where any edge e = (sj, Si) with label axis is forward- stable 

1: a := axpre{si)\axis* 

2: for every node s in D such that axpre{s) = axpre{si) do 

3: refineNode{D, s, a) 
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Figure 5.5: The jc\c\ns neighbourhood of S4 from Figure 3.8 (a) before stabihzing c edge 
to se, (b) after stabihzation 



Example 5.6 (Edge Stabilization) Consider edge {s4,Sq) from Figure 5.5 (a). This 
edge is not forward- stable because elements 4 is not related to any node in extent{sQ) via 



the c axis (i.e. there is no c edge from 4 to a expRoleList element in Figure 3.1). Edge 



stabilization (Algorithm 5.2) creates two nodes, S41 and S42, such that extent{s4i) = {4} 
and extent{s42) = {6,18,23,28}. Since axpre{s4) = [participant] and axpre^s^) = 



[expRoleList] (the original AxPREs), then line 1 of Algorithm 5.2 creates the new AxPRE 
[participant]\c[expRoleList], which will be used to refine all nodes with [participant] 
AxPRE (lines 2 and 3). The new edge (541,55) is forward-stable. The result of stabilizing 



edge {34, se) is shown in Figure 5.5 (b). D 



Example 5.7 (Edge Unfolding) Consider the ns loop on node S42 from Figure 5.5 



(b). The edge is not forward- stable because some element in extent{s42) is not in a ns 
relation with elements in the same extent (for instance, there is no element that is the next 



sibling of 28 in Figure 3.1). Since axpre{s42) = [participant]\c[expRoleList] (the result 
of the stabilization performed in Example 5.6), then line 1 of Algorithm 5.4\ creates the 
new AxPRE [participant]\c[expRoleList]\ns* , which will be used to refine all nodes with 
[participant]\ c[expRoleList] AxPRE (lines 2 and 3). The new edges are forward-stable. 



The result of unfolding ns loop on S42 is shown in Figure 5.6 D 
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Figure 5.6: The neighbourhood from Figure 5.5 (b) after stabihzing ns loop on S42 



Algorithm 5.5 

StabilizeNeighbourhood{D , a, s) 

Input: An SD D, an AxPRE a, and a node s 

Output: An SD D where all the edges in the a neighborhood of s are forward-stable 

1: compute the a neighbourhood of s 

2: S = {s' \ s' is in the a neighbourhood of s} 

3: while S ^^ do 

4: pick a node s' in S such that s' is at the end of the longest simple path from s 

5: for each edge e = {s', s') do 

6: unfoldEdge{D,s',axis) 

7: for each edge e = {s", s') do 

8: stabilizeEdge{D, s', s") 

9: remove s' from S 
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We have now all the building blocks for introducing the neighbourhood stabilization 



algorithm, Algorithm 5.5, which computes a refinement of the extent of an SD node s 
for an AxPRE a that results in a stable a neighbourhood of s. Given an SD node s and 



an AxPRE a, Algorithm 5^ computes an AxPRE partition of the extent of s for a that 
is a refinement of the extent of s. This is achieved by stabilizing the a neighbourhood of 



s. In order to stabilize a single edge, Algorithm |5.5| invokes Algorithm 5.2, for different 



nodes, and Algorithm 5.4, for the same node (loop). Algorithm 5.3 is a variation of 



Algorithm 5^ in which all edges labeled with the same axis are stabilized. 

Most of the execution of the neighbourhood stabilization algorithm is covered by 



Examples 5.6 and 5.7 For instance, if we want to stabilize the [participant].{c\fc\ns* 



neighbourhood of node s^ in Figure 5.5 (a), then Algorithm 5.5 stabilizes edge (s4, Sg), as 



described in Example 5.6, and unfolds edge {sj, Sj) labeled ns, as described in Example 



5.7 The resulting stable [participant].{c\fc\ns*) neighbourhood is shown in Figure 5.6 



In this chapter, we discussed how an SD description can be changed by operations 
that modify its AxPREs and thus the AxPRE neighbourhoods of the nodes in their 
extents. We introduced the two basic DescribeX operations, AxPRE refinement and 
stabilization, and provided algorithms for them. We also gave, for LPath distinguishable 
neighbourhoods, a characterization of the extent of an SD node with an EE. In the next 
chapter, we discuss the XPath syntax and data model, together with a novel mechanism 
for expressing EEs in XPath. 



Chapter 6 



Changing descriptions with XPath 



We have discussed how to characterize SD nodes and their extents using different ap- 
proaches based on neighbourhoods, sets of axis label paths, and AxPREs. In this chap- 
ter, we propose a novel mechanism to characterize an SD node with an XPath expression 
|W3C07j whose evaluation returns exactly the elements in the extent. This expression, 
which effectively represents the extent of a given SD node s, is called extent expression 
(EE) and is denoted ee(s). 

In DescribeX, the extents of any SD node can be precomputed and stored in a data 
structure. This approach, which we call materialized extents, uses a pointer to every 
XML element in the collection and therefore it can be very space consuming. Since the 
evaluation of an ee{s) of a node s returns the actual extents of s, a more space-efficient 
approach is to keep only the EEs. These virtual extents are a compact representation of 
the extents, similar to the concept of virtual views. 

Since EEs are expressed in XPath, we give first an introduction to the XPath syntax 
and data model. The formal semantics definition of the full language is provided for 
completeness in Appendix [A} 

75 
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6.1 XPath syntax and data model 

XPath is a compositional language for selecting element nodes in XML documents. It 
is also the dialect that most XML manipulation languages (e.g., XSLlQ XPointeiQ 
XQuerjrl etc.) have in common. In this section we introduce the language expression 
grammar and its data model based on axes. 

Definition 6.1 (XPath Expression Grammar) Let e, ei . . . e^ be expressions, locpath, 
locpathi, . . ., locpathm be location paths, I be a node name from the label alphabet Label of 
the axis graph, axis be a relation in Axes, and op be a place holder for any of the XPath 
functions and operators such as +, — , *, div, =, 7^, <, <, > and >, as well as for context 
accessing functions position{) and last{). The following is the grammar for XPath 1.0 
expressions: 

e := disj \ op(ei, . . . ,6^) 

disj := locpathi | . . . | locpath^ 

locpath := par \ comp \ abs \ step 

par :={ disj ) [ci] . . . [e^] 

ccrnip := locpathi / locpath2 

abs := / locpath 

step := axis :: / [ei] . . . [e^] 

D 

The XPath data model includes atomic values, sequences, and a predefined set of axes 
for navigating the instance. Like in an axis graph, which is an abstract representation 
of the XPath data model, axes define relationships between nodes in the instance. We 



^ http://www.w3.org/TR/xslt 
^ http://www.w3.org/TR/xptr/ 
^ http : //www . w3 . org/TR/xquery/ 
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provide next a definition of tlie XPatli axes in terms of firstchild, nextsibling , their 
inverses and self. 

Definition 6.2 (XPath Axes) Given an axis graph A = {Inst, Axes, Label, A), the 
XPath axes in A are defined as follows: 

• self := {{v,v) \v& Inst} 

• child := f ir stchild. nextsibling* 

• parent := {nextsibling~^)* .firstchild~^ 

• descendant := firstchild. {firstchild[Jnextsibling)* 

• ancestor := {f irstchild~^ [J nextsibling"^)*. fir stchild^^ 

• descendant-or-self := descendant[j self 

• ancestor -or -self := ancestor [J self 

• following := ancestor -or -self .nextsibling. nextsibling* .descendant-or-self 

• preceding := ancestor -or -self .nextsibling~^ .{nextsibling~^)* .descendant-or-self 

• following -sibling := nextsibling .nextsibling* 

• preceding-sibling := {nextsibling~^)* .nextsibling~^ D 

Whenever it is clear from the context, we use s, c, p, d, a, ds, as, f , pc, fs and ps as 
abbreviations of self , child, parent, descendant, ancestor, descendant-or-self , ancestor - 
or -self , following, preceding , following-sibling and preceding-sibling , respectively. D 

Note that the self, ancestor, descendant, preceding, and following axes from a 
given node v partition the nodes in the XML tree. This is represented graphically by the 



schema in Figure 6.1 
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ancestor 




Figure 6.1: Partition of the nodes in the XML tree by axis relations 

Since XML documents are ordered, we need to define a document order relation on 
the nodes of an axis graph A. 

Definition 6.3 (Document Order) The document order relation -<doc on an axis graph 
A = {Inst, Axes, Label, A) is the total order relation given by d\^ f, where d and f are 



the XPath axes in Axes from Definition 6^. D 



Based on the document order relation and its inverse we define next axis order and 
axis position. 

Definition 6.4 (Axis Order) Let axis graph A = {Inst, Axes, Label, A) be an axis 
graph. We define the binary axis order relation -<axis in Inst x Inst as -<doc if axis G {s, 
c, d, ds, f , fs} and as -<^^^ otherwise. D 

Having introduced the XPath syntax and data model, we discuss next how descrip- 
tions are changed in DescribeX using XPath. 

6.2 Refinement with XPath 

Whenever the representative neighbourhood of an SD node s is LPath distinguishable, 
it is possible to precisely characterize the extent of s in terms of the axis label paths in 



its LPath set (see Chapter 5.1). For this class of neighbourhood, nodes with the same 



Chapter 6. Changing descriptions with XPath 79 



LPath set are bisiinilar (Proposition 5.1). Therefore, we propose a mechanism capable 
of computing the extent of s based on its LPath set. 

First, we need a few auxihary results that show how an axis label path in a given 
LPath set can be captured by a single XPath expression. We will show later how to 
derive EEs from these axis label path expressions. In order to prove our results, we use 
the XPath formal semantics given in Appendix [A} 

Lemma 6.1 Let e be an XPath expression of the form axisi :: l\l ...jaxiSn '■'■ In- U 
Vlel{v) 7^ then there exists an axis label path Ip = axisi[li] axis„[/„] from v. D 



Proof 6.1 //I?[e](f) 7^ 0, then by semantic rule (A.l) in Figure A.l there must exist 
Wi,...,f„ such that {v,Vi) G axisiA (fi,'y2) G axis2 A . . . A {vn-i,Vn) G axiSn, and 
\{vi) = li, 1 < i < n. This means that there is a path from v to v^ going through edges 
axis I, . . . ,axiSn and nodes fi, . . . ,f„ such that axisi[li] axis„[/„] is its axis label path. 

D 

Lemma 6.2 Let e be an XPath expression of the form axisi :: h/ . . . /axiSn '■'■ In- If 
Vlel{v) = then there is no axis label path Ip = axisi[li] axisn[ln] from v. D 



Proof 6.2 // ■D|e](f) = 0, then by semantic rule (A.l) in Figure A.l there are no 
wi, . . . ,Wm such that {v,wi) G axisi, {wi,W2) G axis2, ■ ■ ■ , {wm-i,Wm) G axism and 
X{wi) = li, 1 < i < m. This means that there is no path from v to Wn going through 
edges axisi, . . . , axiSm and nodes wi, . . . ,Wm, and thus there is no axis label path Ip = 
axisi[li] axiSm[lm] from v. D 

Consider SD node s with AxPRE a. In order to compute the extent of s we need 
to get all nodes that have the same LPath set and label as s. Therefore, we need 



to write an XPath expression Cj as defined in Lemma |6.1| for each different axis label 
path Ipi in LPath. Then, all e, expressions have to be combined in one EE as follows: 
exp = /ds :: A(s)[ei] . . . [e„]. However, such an expression does not guarantee that the 
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returned nodes have the exact {Ipi, . . . , Ipn} LPath set: it only guarantees containment. 
That is, exp will return all nodes v such that LPath{v) 3 {Ipi, . . . , Ipn}- The reason for 
that is that exp says that all [ci] . . . [e^] have to be satisfied, but it does not say they have 
to be the only ones, which would be required for equality. The way of circumventing this 
problem is by explicitly adding a [not{ei)] predicate for each Ipi that is not in LPath{v). 
The problem with this approach is that it would require the explicit negation of a 
large number of axis label paths. However, we can drastically reduce that number by 
considering only SD nodes that have an AxPRE a' such that a' C a. The intuition is 
that, if two AxPREs of SD nodes Si and S2 are not in a containment relationship, then 
nodes in their extents cannot have LPath sets in a containment relationship either and 
we do not need to have a not{) predicate for them. The following example illustrates 
how an EE is composed from axis label paths expressions and not{) predicates. 

Example 6.1 (Extent Expressions) Consider SD nodes S41, S42, and S43 from Fig- 



ure 



6.2. For the EE of S41, we need all axis label paths that are in LPath{s4i) but not 
in LPath{s42) U LPath{s4^). The required LPath sets are the following: LPath{s4i) = 
{c[inter actor Re f]}, LPath^s^s) = {c[interactorRef],c[expRoleList].fc[expRole]}, and 
LPath{s42) = {c[interactorRef],c[expRoleList].fc[expRole].ns[expRole]}. The final EE 
will have a positive predicate for each string in LPath{s4i) and a negative one for 
each string in {LPath{s42) U LPath{s4s)) — LPath{s4i). The resulting expression is 
ee(s4i) = /ds :: participant[c :: inter actor Re f][not{c :: expRoleList/c :: *[l][s :: 
expRole])][not{c :: expRoleList/c :: *[l][s :: expRole]/fs :: *[l][exp-Ro/e])], where c :: 
*[l][s :: expRole] and fs :: * [1] [ea;pi?o/e] are the XPath expressions of fc[expRole] and 
ns[expRole], respectively. D 

Note that the EEs resulting from this approach might have redundant predicates that 



can be simplified. Consider Example 6.1 for instance: if a node does not exists, neither 



does a following sibling for that node, then the last predicate for ee(s4i) can be removed 
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[expRole].ns* 

{9,31} 




[expRole].ns' 

{11,21,26,33} 



Figure 6.2: The [participant]. c.fc.ns* neighbourhood from Figure 3.9 



safely. There are many other useful simplifications that can be applied to EEs, but a 
broad theory of equivalence is beyond the scope of this thesis. 

The next proposition shows that the EEs thus constructed return all nodes that do 
have the axis label paths specified in the positive predicates and do not have those in the 
negative predicates. 

Proposition 6.1 Let A be an axis graph andcg = ds :: /[ci] . . . [e„][not(e„+i)] . . . [noti^Cm)] 
an XPath expression where Ci = axisi^ :: li^/ . . . jaxisi^, :: Zj^. , \ <i <m. Then, Cg re- 
turns all nodes v such that there exists Ipi, . . . , Ipn axis label paths from v and there are 
no Ipn+i, ■ ■ ■ , Ipm axis label paths from v, where Ipi = axisi^ [/jj axisi^^ [Uk]- ^ 



Proof 6.3 

V[ds::l/[ci]...[en][not{e';)\... 



notie' 



by semantic rules (A. 5) and (A. 9) in Figure A. 2 
{v I \{v) = 1 A {vo,v) eds /\'^=iSleil{v,posds{v,S),\S\) = true 

/\T=n+i^l^ot{ei)j{v,posds{v,S),\S\) =true] 
since all nodes v are reachable from vq, {vo,v) G ds is always true 
{v I \{v) = I /\'^=iSlei]{v,posds{v,S), \S\) = true 

/\'^^^^^Slnot{ei)]{v,poSds{v,S),\S\) = true} 



Chapter 6. Changing descriptions with XPath 



82 



S4 

[participant] 

{4,6,18,23,28} 



S 41 S 42 

[participant].c* [participant]. c* 

{4} {6,18,23,28} 



S5 

[interactorRef] 

{5,7,19,24,29} 



[expRoleList] 

{8,20,25,30} 



t 
S7 

[expRole] 

{9,11,21,26,31,33} 




[interactorRef] [expRoleList] 



{5,7,19,24,29} 



{8,20,25,30} 



t 

[expRole] 

{9,11,21,26,31,33} 



(a) 



(b) 



Figure 6.3: The [participant\.c* neighbourhood of S4 from Figure 3.8 (a) before a c* 
refinement, (b) after the refinement 



hy semantic rule (A. 4) in Figure 



A.l 



A. 3 



with Op = not{boolean{S) from Figure 
{v I X{v) = I /\'^=iSleij{v,posds{v,S), \S\) = true 

/\Zn+i^bili'"^POSds{v,S), \S\) = false} 



by semantic rule (A.l) in Figure A.l 
{v I X{v) = I At.Vle^iv) = true Al^n+i ^N(t;) = false} 
by Lemmas 



6.1 



and 



6.2 



{V I \{v) = I A 3lpi, ..., 3lpn A ^Ipn+l, • • • , ^IPm} 



D 



In some special cases, a more compact XPath expression can be obtained. For in- 
stance, for an expression containing the closure of an axis, like c*, we can enforce that the 
Ipi^s expressed by the EE are the only ones by using the countQ XPath function. Since 
the XPath expression of each Ipi for c* contains only compositions of the child axis, the 
set of nodes reached by all Ipi^s and all their substrings are exactly all the descendants. 



Example 6.2 Consider the [participant]. c* neighbourhood of nodes s'^i and S42 in Figure 



6.3. The extents of s'^i and S42 are {4} and {6,18,23,28}, respectively. The LPath sets 



of the nodes are LPath{s'^^) = {c[inter actor Re f]} and LPath{s'^2) = {c[interactorRef], 
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c[expRoleList\.c[expRole]} , whereas the EEs are ei = ds :: participant[c :: inter actor Re f] 
[count{ds :: *) = count{c :: inter actor Re f)] ande2 = ds :: participant[c :: inter actor Re f] 
[c :: expRoleList][c :: expRoleList/c :: expRole][count{d :: *) = count{c :: inter actor Re f) + 
count{c :: expRoleList) + count{c :: expRoleList / c :: exp/2o/e]. D 

The next proposition shows, for the special case of c*, that the EEs thus constructed 
return a set of nodes that do have the axis label paths specified in the predicates. 

Proposition 6.2 Let A he an axis graph and Cg = ds :: l[ei] . . . [en][count{d :: *) = 
count{ei) + . . . + count{en)] an XPath expression where Ci = c :: U^/ . . . /c :: Zj^, , 1 < « < 
m. Then, Cg returns all nodes v such that there exists only Ipi,. . . ,lpn axis label paths 
from V of the form Ipi = c[/jj c[/jj.J. D 



For proving Proposition 6.2 we need the following Lemmas. 



Lemma 6.3 Let e = child :: h/ . . . /child :: Im be an XPath expression. For every node 
V e Inst : Vle\{v) C I?|rf :: *\{v). U 

Lemma 6.4 Let S and Si, . . . , Sn be sets such that Si (^ S and Si ^ for 1 < i < n. 

Then\S\ = \ J:tlS^ \^S = [JliS,. D 



Using Lemmas 6.3 and 6.4 we can prove Proposition 6.2 as follows 



Proof 6.4 

Did :: */[ei] . . . [en][count{d :: *) = count{ei) + . . . + count^Cj 



by semantic rules (A. 5) and (A. 9) in Figure A. 2 
{v \ {vo,v)ed /\'^^^£leil{v,posd{v,S),\S\) = true A 

£\count{d :: *) = count{ei) + . . . + count{en)\{y,posd{y,S), \S\) = true} 
since all nodes v are reachable from vq, {vo,v) & d is always true 
{v I A'i=i^biliv,POSd{v,S), \S\) = true A 

Slcount{d :: *) = count{ei) + . . . + count{en)}{v , poSd{v , S) , \S\) = true} 
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by semantic rule (A. 4) in Figure 



A.l 



with Op's count, +, and = from Figure 



A. 3 



{v I M,SlcMv,pos,{v,S), \S\) = true A | Vld :: q{v) \ = ELi I ne^}{v) \ } 



by semantic rule (A.l) in Figure A.l 
{v I ^t,ne^}{v) ^ A I Vld :: *}{v) \ = ELi I ^NM I } 
by Lemmas 



6.3 and 



6.4 



{v I ^t,ne^}{v) ^ A Dirf :: *!(.;) = Ur=i^NW} 



by semantic rule (A. 9) in Figure A.l 

{v I Ar=i^N(^) ^^ A {w I {v,w) G 4 = ur=i^NM} 

since the Ipi 's are of the form Ipi = c[/jj c[/jj. ] 

{t" I {w I {v, w) E d} = {w \ p = {v , c, . . . , c, w) is an axis path A A(p) = Ipi} 



D 



6.3 Stabilization with XPath 



As we have seen in Chapter 5.2, edge stabihzation can be reduced to node refinement. 



However, when the EEs of the nodes in an edge are available, we can use the description 
provided by the EEs and compute the stabilization directly from them. The idea is to 
express the condition for forward-stability (i.e., Vx G extent{si), 3y G extent{sj)A{x, y) G 
axis) of an edge (sj, Sj) in XPath using ee(sj) and ee(sj). 



Algorithm 6.1 computes the stabilization of a single edge by updating the EEs of the 
nodes in the edge and their extents. The algorithm replaces node Si by two new nodes: 
s'j and s'{. The extent of s'^ contains all nodes in the extent of the original Sj that are in 
an axis relation with nodes in the extent of Sj (line 2). The extent of s'/ contains the 
complement of s'^ with respect to Sj, i.e., it contains all nodes that do not have such an 
axis relation with nodes in the extent of Sj (line 3). Consequently, after the new edge 
is created (line 6), s[ has a forward-stable axis edge to Sj whereas s'- does not have any 
axis edge to Sj. The EEs obtained in lines 4 and 5 are the EEs for the new nodes. 



Chapter 6. Changing descriptions with XPath 85 

Algorithm 6.1 

stabiUzeEdgeXPath{D , Sj, Sj) 

Input: An SD D containing a non forward- stable edge e = (sj, Sj) with label axis 
Output: An SD D where e has been replaced by forward- stable e' = (s^, Sj). 

1: create new nodes s[ and s'- 

2: extent{s'j) := {x G extent{si) \ 3y G extent{sj) A {x,y) G axis} 

3: extent {s'-) := extent (si) — extent {s'^) 

4: ee(s'j) = ee{si)[axis :: X{sj) intersect ee{sj)] 

5: ee{s'l) = ee{si)[not{axis :: X{sj) intersect ee{sj))] 

6: create an edge e' = {s[, Sj) 

7: let S be the set of nodes connected to Sj 

8: for every node s in S do 



9: add edges {s[,s) and (s, s'J if conditions in Definition 3.16 are satisfied 
10: delete node Si, and all its incoming and outgoing edges 



Note that we do not need additional count{) nor not() predicates in the new expres- 
sions because all the required ones are already in ee(sj) and ee{sj). 



Example 6.3 Consider edge (54, Sg) from Figure 5.5 (a), which is not forward- stable. 
Edge stabilization will create two nodes, S41 and S42 as shown in Figure 5.5 (b). Given 
that ee(s4) = / ds :: participant , ee(s6) = / ds :: expRoleList , and the stabilized edge 
corresponds to a c axis, the resulting expressions are the following: 66(542) = /ds :: 
participant [child :: expRoleList intersect /ds :: expRoleList] and 66(541) = /ds :: 
participant[not {child :: expRoleList intersect /ds :: expRoleList)]. D 
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6.4 Adapting SDs to XPath queries 

Previously in this chapter, we used XPath to express EEs and to manipulate them for 
refinement and edge stabilization operations. In this section we show how XPath queries 
are used to guide a refinement operation in the process of adapting an SD to a query. 

In order to evaluate a query using an SD, we need to find the SD nodes that participate 
in the answer. DescribeX's approach is to find the SD nodes that contain a superset of 
the answer and then evaluate the entire expression on them to get the exact answer. 

One of the central problems for finding a superset of the answer is how to decide 
what SD nodes can be used to answer an XPath query. This requires some sort of XPath 
matching algorithm and the ability to decide whether there exists an exact rewriting of a 
query using an SD. The matching algorithm will transform the structural subquery of the 
XPath expression (the expression that results from removing all non-structural predicates 
such as those containing functions) to be evaluated into an equivalent AxPRE a. Then, 
we need to find the SD node (or nodes) whose AxPRE is contained in a. The union of 
the extents of such nodes are a superset of the answer. If the query is purely structural 
(i.e. the query is equal to its structural subquery) and a is equivalent to some SD node 
AxPRE, then the answer to the query is exactly the union of the extents. Otherwise, we 
need to run the entire query on the union of the extents to find the exact answer. 

We begin by discussing in the next section how to derive an AxPRE from an XPath 
expression. 

6.4.1 Deriving AxPREs from queries 

DescribeX can adapt an SD node to an XPath query Q, as we have illustrated in our 



RRS feeds motivating example in Chapter 1.2 This section formalizes how an AxPRE is 



obtained from Q by using the two derivation functions L and P we provide in Figure |6.4 
We begin by illustrating the XPath-to- AxPRE derivation with a concrete example. 
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P(Op(ei,...,eJ):=e (6.1) 

P{axis::l[ei] . . . [em]/rlocpath) := Ax{axis).{P{ei)\. . .\P{em)\P{rlocpath)) (6.2) 

P{{locpath)[ei] . . . [em]/rlocpath) := P{locpath).{P{ei)\. . .\P{em)\P{rlocpath)) (6.3) 

P{locpathi\. . .\locpathm) '■= {P{locpathi)\. . .\P{locpathm)) (6-4) 

L{rlocpath / axis :: l[ei\ . . . [cm]) '■= Ax{axis^^).{L{rlocpath))\P{ei)\. . .\P{em) (6.5) 

L{rlocpath / {locpath)[ei\ . . . [cm]) := L{locpath) .{L{riocpath))\P {ei)\. . .|P(em) (6.6) 

L{locpathi\. . .\locpathm) := {L{locpathi)\. . .\L{locpathm)) (6.7) 

Figure 6.4: AxPRE derivation functions L and P 

Example 6.4 Consider the following query 

Q3 = /ds: : participant [c: : expRoleList/fc : : expRole/ns: :expRole] 
[not(ds : : expRole/names=' 'prey' ')] 

Q3 returns all participants that have expRoleLists whose first two children are expRole 
elements and that are not playing the "prey" role in the experiments. Note that the 
structural subquery appears in black (the last predicate in grey is not part of the structural 
subquery). 



The first rule of Figure 6.4 that applies is (6.5) with the following variables: rlocpath 



0, axis = ds, I = participant, Ci = c :: expRoleList/ fc :: expRole/ns :: expRole, and 
62 = not{ds :: expRole / names = ^^prey"), resulting in 

Ax{ds~')\P{e,)\P{e2) 

where Ax is a function that translates the XPath axis into its AxPRE axis counterpart. 
In particular, Ax{axis~^) returns the actual AxPRE inverse (e.g., child~^ is converted 
into p) and recursive axes are translated to an equivalent Kleene closure of non-recursive 
axes (e.g., descendant translates into c* ). 
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The expansion of P{e2) is very simple. The predicate is basically a function, so 



it matches rule (6.1) and the result of P{not{ds :: expRole / names = ^^prey")) is e 
(Remember that this predicate is not part of the structural subquery). This results in the 
following intermediate expression 

as\P(ei)\e 



For expanding -P(ei), the first rule invoked is (6.2) with axis = c, I = expRoleList, 
rlocpath = fc :: expRole/ns :: expRole and empty predicates. The intermediate expres- 
sion is now 

as\Ax{c).{P{fc :: expRole/ns :: expRole)) 



For expanding P{fc :: expRole/ ns :: expRole), the rule that applies is (6.6) with 



axis = fc, I = expRole, rlocpath = ns :: expRole and no predicates, which results in 

as\c.Ax{fc).{P{ns :: expRole)) 

Similarly, we can expand P{ns :: expRole) and obtain 

as\c.fc.ns 

Finally, the node test of the step corresponding to the answer (participant in this 
case) is prefixed as a label predicate to the AxPRE. Therefore, the resulting AxPRE of 
query Q3 is 

aQ3 = [participant].{as\c.fc.ns) 

Once the query AxPRE a of a given XPath query Q is computed, the next step in 
adapting the SD to Q is finding the SD node (or nodes) whose AxPRE a' is contained 
in a. Since the problem of AxPRE containment is related to that of regular expres- 
sion containment, any regular expression containment algorithm can be used here. After 
finding the node, DescribeX proceeds to change a' to a, which in fact modifies the de- 
scription of the node and thus the neighbourhood it summarizes. This entails performing 
a refinement of the extent of the node. 
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6.4.2 Finding candidates 

If an extent contains a superset of the answer of a query, then we say that the elements 
in such an extent are candidate elements. Note that, by adapting the SD to the struc- 
tural subquery, DescribeX has found a restricted superset of the answer and hence has 
considerably reduced the search space for computing the entire query. 

The DescribeX architecture is tailored to process XML collections one file-at-a-time, 
the prevalent data processing model for the Web. Each file is parsed and processed 
independently of the other files in the collection. In this context, after adapting the 
SD to a given query Q, DescribeX can restrict the evaluation of Q to those documents 
(called candidate documents) that are guaranteed to provide a non-empty answer for the 
structural subquery of Q. Those candidate documents that do contain an answer for the 
entire query are called answer documents. 

Once DescribeX has computed the query AxPRE a of a given XPath query Q as 
described above, it needs to find the SD node whose AxPRE contains a in order to get 
the candidate documents for evaluating Q. If there is an SD node s with AxPRE a, then 
all documents in the extent of s are in fact candidate documents. In contrast, if s has 
an AxPRE a' containing a, DescribeX has two alternatives. One, it can adapt the SD 
by refining s from a' to a and then get the candidate documents as in the previous case. 
Two, it can get all documents in the extent of s and run the structural subquery of Q on 
them in order to get the candidates. Once the candidate documents are found, finding 
the answer documents entails running Q on all candidates. 

Example 6.5 Consider query Q3 for our running example. We could evaluate Q3 using 
the label SD from Figure \3.8[ In that case, the only node whose AxPRE is contained 
in aQ3 is s^ with AxPRE [participant]. For simplicity, let us assume that every node 
in the extent of S4 belongs to a different document, so there will be as many elements 
as documents, both candidates and answer. From the SD graph we know that not all 
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participants in the extent of 84^ contain an expRolcList element because the edge (54, Sg) 
is not forward- stable. So we conclude that S4 contains only a superset of the answer, so 
we get the six documents from the extent and evaluate the query in all of them in order 
to get the answer. 

Alternatively, we could use the refined SD from Figure 3.£\ This SD could have been 



obtained from the label SD after adapting it to Q3. Regardless of how the SD was created, 
we found that three nodes have AxPREs contained in aq^: S41, S42, and S43. However, 
we notice from the SD graph that only node S42 has a forward-stable neighbourhood for 
«Q3. (Note that it is the only [participant] node with an edge c, followed by a fc and an 
us, all forward-stable.) That means that both nodes (and thus documents) in the extent 
of S42 are candidates, and thus we need to run Q3 only in those two documents. If Q3 
did not have the second predicate (in grey), the extent of S42 would be the exact answer 
ofQS. D 

The process of exploring candidates is not unidirectional: a developer can move back 
and forth between the query explanations described here and the structural exploration 
described in Chapter [1.2 For instance, she may create an SD and run a query on some 



candidate documents. Next, she might decide to relax the query in order to further 
investigate its impact on the collection. Then, she may want to get a more or less refined 
description of the collection by changing the SD using AxPRE refinements, and then start 
the process again. DescribeX provides the developer with this interactive functionality 
for describing and evaluating XPath queries on large XML collections. 



Chapter 7 



DescribeX engine 



In previous chapters we introduced DescribeX, a powerful framework capable of declar- 
atively describing complex structural summaries of XML collections that captures and 
generalizes many proposals in the literature. We also showed how summary descriptors 
(SDs) are created and refined to selectively produce more or less detailed descriptions of 
the data. In this chapter, we discuss how the DescribeX framework is implemented in 
the summarization engine and present two strategies for refining an SD: one is based on 
materializing the SD partitions, the other is a virtual approach that relies on constructing 
XPath expressions that compute extents. 

The DescribeX architecture is tailored to process XML collections one file at a time, 
the prevalent data processing model for the Web. Each file is parsed, processed and 
stored before continuing with the next file in the collection. Such an approach supports 
the interactive creation and refinement of AxPRE SDs for large collections of XML 
documents. 

The DescribeX engine is implemented in Java using Berkeley DB Java Editionj^ to 
store and manage indexed collections (tables). The implementation can invoke an ar- 
bitrary JAXP l.qj XPath processor for the evaluation of XPath expressions. JAXP is 



^http : //www . oracle . com/technology/products/berkeley-db/ j e/ 
^http://jaxp.dev. j ava.net/ 1 .3/ 
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an implementation independent portable API for processing XML with Java. For the 
experiments reported later in this paper, the SaxonjjXPath processor was employed. The 
Saxon implementation conforms to the XPath 1.0 standard set by the W3C |W3C99j and 
therefore satisfies the semantic characterization formalized in Appendix |Xj 

The DescribeX implementation stores the extents in an indexed table named elemDB 
that has schema elemPBf SID , docID , endPos , startPos, SID2), where the underlined 
attributes are the key (also used for indexing). The elemDB table contains a tuple for 
each XML element in the collection. Each SD node is identified by a unique id called 
SID. Each element belongs to the extent of a unique SD node, whose SID is stored in 
the SID attribute. The attribute docID holds the identifier of the document in which the 
element appears. The startPos and endPos are the positions, in the document, where 
the element starts and ends, respectively. SID2 allows us to maintain an SID for a second 
SD. 

Alternatively, the user can decide to keep the extents virtual and thus make the 
DescribeX engine store a docDB table instead of the elemDB table described above. The 
schema of the docDB table is docDB(SID, docID), which contains for each sid s the docIDs 
of all XML documents containing elements in the extent of s. This can be used to 
efficiently locate the XML documents to be evaluated by the EE of s in order to get the 
extent of s. The EEs are stored in a separate XML file. 

A third scenario in which both elemDB and docDB tables coexist is also possible. In 
such a case, some SIDs would be kept in the elemDB table (with their extents materialized) 
and some others would be stored without extents in the docDB table. In this thesis we 
have not studied the trade offs emerging from this scenario. 

The DescribeX engine keeps the SD graph in main memory in separate hash tables 
for each axis relation in the SD, e.g. the parentsMap and childrenMap maps contain the 
edge definitions for the p and c SD axes respectively. In other words, each binary axis 

■^http : //saxon . sourcef orge . net/ 
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relation is stored as a map between a key SID s and a set of SIDs Si, . . . , s„ such that 
(s, Si) G axis, 1 < i < n. In addition, there is a label map, labelMap, that contains the 
label of each SD node. 



7.1 Initial SD construction 

Some SDs can be constructed in one pass over the collection. This is possible when 
the parsing information collected at either the start tag or the end tag of an element v 
is enough to construct the AxPRE neighbourhood A/'a(f ) of the element, compute the 
AxPRE partition and thus decide to what extent v belongs. For instance, the start tag 
itself is enough to classify an element v when constructing the e SD (the M^iv) contains 
just node v). For the p'^ and p* SDs, it suffices to keep the sequence of the last k open 
elements (for the p^) or all of them (for the p*) for creating Mpk{v) and Mp* {v). Thus, p^ 
and p* SDs can also be constructed in one pass over the collection. 



Algorithm 7.1 (buildP(k)) illustrates the use of the DescribeX data structures. The 
algorithm computes the e, p'^ and p* SDs. The parameter k encodes the SD as follows: 
A; = corresponds to e, fc = maxint to p*, and all other values represent p''. For each 
XML document in the collection, the algorithm parses the document and creates a XOMJ^ 
tree (a lightweight XML object model). The algorithm uses the XOM tree created for 
composing the elemDB tuple of each element in the document containing SID, docID, and 
its beginning and end offset position. Both the XOM tree and the SD are constructed 
simultaneously during parsing time. 

Once an SD has been constructed from scratch, the user can refine any SD node or set 
of nodes by changing the node's AxPRE, as described in Chapter |5j In the next section 
we provide algorithms for computing such refinements. 

^http : //www . xom . nu/ 
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Algorithm 7.1 

buildPjk) 

Input: Collection C of XML documents 
Output: p'' SD 

1: for each XML document doc in collection C do 

2: assign a new docID d to doc 

3: create a new XOM tree t 

4: while parsing doc do 

5: if element start tag is found in doc then 

6: create a new e in t with XML attributes sid, startPos, and endPos set to 

empty 

7: if the p^ neighbourhood of e is not in the SD graph then 

8: create a new SID s' 

9: update labelMap, parentsMap, and childrenMap 

10: store the p^ XPath expression of s' in the EE XML file 

11: get the sid s of e from the SD 

12: set e.sid to s and e. startPos to the offset position of the start tag of the 

element 
13: if element end tag is found in doc then 

14: set e.endPos to the offset position of the end tag of the element 

15: append tuple {e.s^d, e.endPos, e. startPos) to elemDB 
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7.2 Computing refinements 

Following the materialized extents approach, a refinement can be evaluated with Al- 



gorithm 7.2 (refineMaterialized) , whereas virtual extents can be refined by Algorithm 7.3 



(refine Virtual) . Both algorithms are invoked with sid s to be refined, its current EE e^ 



and a family ri . . . r„ of refining EEs, constructed as described in Chapter 6.3 



Suppose that SD node Sj with EE r^ is one of the refinements of SD node s with 
EE Cg- The extent of Sj is computed by evaluating r, on the set of documents that 
contain elements in the extent of s, which entails evaluating the expression /cs/ri (line 



6 of Algorithms 7.3 (refine Virtual) and 7.2 (refineMaterialized). This set of documents 



are obtained from ElemDB (if the extent of s is materialized) or from docDB (if the extent 
of s is virtual). Once we have the extent of Sj, the edges in the SD graph can be 
constructed either from the EE when the extent is virtual (by computeEdgeByXPath, 
line 10 of Algorithm refine Virtual) or from ElemDB when the extent is materialized (by 
computeEdgeByMerge, line 13 of Algorithm refineMaterialized). 

In order to update the edges, we need to check whether there is an axis edge between 
Si and a set of candidate SD nodes Ci, . . . , c„ such that (s, Cj) G axis. This is performed 



by Algorithm 7.4 (computeEdgeByXPath) by computing the expression Cg^/axis :: *nec , 
where Cc- is the EE of candidate Cj (line 4). If the evaluation of the expression is not 
empty, then there exists an edge from Sj to Cj, otherwise there is no edge (lines 5 and 6). 

Algorithm computeEdgeByMerge (not shown), in contrast, simply computes a merge 
of the ElemDB using the startPos and endPos attributes to check for containment (in 
case of /c, c, p, a, and d axes) or precedence (for ns, fs, f, and p axes). 
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Algorithm 7.2 

reGneMatenaUzed{sd, s, ri, . . . , r„) 

Input: sd is the SD, s is the sid to he refined, ri . . .Tn is a family of refining XPath EEs 
Output: Updated sd 

1: get the XPath EE Cs of s 

2: for each input r^ do 

3: create a new sid Si 

4: for each d s.t. there is a tuple t^ in elemDB with t^.SID = s and td-docID = d 

do 
5: create a XOM tree t of d in which each element has an endPos attribute with 

the offset position of the end tag of the element 
6: assign to extent the answer of /cs/ri 

7: for each element Uj in extent do 

8: locate the tuple tj in the elemDB table corresponding to Uj by using (s, d, 

Uj.endPos) as a key 
9: assign Si to tuple tj by setting tj.SID = Si 

10: update labelMap by assigning the label of s to the new Si 

11: store the r, EE of Si in the EE XML file 

12: for each axis in the SD do 

13: call computeEdgeByMerge{sd, axis, Si, extent, s) to test the existence of an 

axis edge from Si 
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Algorithm 7.3 

reGneVirtual{sd, s, ri, . . . , r„, extent) 

Input: sd is the SD, s is the sid to he refined, ri . . .Vn is a family of refining XPath EEs 
Output: Updated sd, extent with the element in the extent of Si 

1: get the XPath EE e^ of s 

2: for each input r-i do 

3: create a new sid Si 

4: for each d s.t. there is a tuple t^ in docDB with td-SID = s and td-docID = d do 

5: create a XOM tree t of d in which each element has an endPos attribute with 

the offset position of the end tag of the element 
6: assign to extent the answer of /cs/ri 

7: update labelMap hy assigning the label of s to the new Si 

8: store the Vi XPath expression of Si in the EE XML file 

9: for each axis in sd do 

10: call computeEdgeByXPath{sd, axis, Si, extent, s) to test the existence of an 

axis edge from Si 
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Algorithm 7.4 

coniputeEdgeByXPath{sd, axis, Sj, extent, s) 



Input: sd is the SD, axis is the axis edge to he computed, Si is the new sid, extent is 
the extent of si, and s is the sid being refined. 
Output: Updated sd 

1: assign to candidates the set of sids {ci, . . . , c„} mapped to s in axisMap 

2: for each Cj in candidates do 

3: get the EE Cj of Cj from the EE XML file 

4: evaluate the intersection expression e = axis :: * n Cj from extent 

5: if the evaluation of e is not empty then 

6: add an axis edge between Si and Cj to the corresponding axisMap 

In this chapter, we presented an implementation of the DescribeX framework that 
supports the interactive creation and refinement/stabilization of AxPRE SDs for XML 
collections. We introduced two strategies for locally updating an SD: one based on ma- 
terializing the SD partitions (extents), the other relies on a novel virtual approach based 
on XPath expressions. The next chapter presents experimental results that demonstrate 
the scalability of our strategies, even to multi gigabyte web collections. 



Chapter 8 



Experimental results 



We present here the results of an extensive empirical study we conducted using the 
DescribeX framework introduced in this thesis. 

The first part of our study evaluates the performance of the initial SD construction and 
the feasibility of the different approaches (materialized, virtual, edges, etc.) to DescribeX 
main exploration operations: refinement and stabilization. The objective here is twofold. 
First, to understand how key parameters (e.g., extent size, number of documents involved, 
and number of SD nodes and edges affected) impact on each operation. Second, to 
determine what method performs better under what kind of conditions. 

The goal of the second part of our experimental evaluation is to study the impact 
of various summaries on XPath query processing performance. This part also provides 
a comparison with variations of incoming and outgoing path summaries capturing ex- 
isting proposals like 1-index, APEX, A(k)-index, D(k)-index, and F+B-Index. We want 
to emphasize that query evaluation times on collections the size of Wikipedia are rarely 
reported in the literature. In fact, XML DB systems (and not just research prototypes) 
become challenged when working with collections at this scale. The experiments demon- 
strate that DescribeX easily scales up to gigabyte sized XML collections with important 
performance results. 

99 



Chapter 8. Experimental results 



100 







Table 8.1 


Test collections 






Collection 


Size 

(MB) 


#Docs 


#Nodes 


Load Time (s) 


p* SD 


label SD 


p* SD 


label SD 


RSS2 


210 


9600 


1058 


301 


64.2 


41.4 


PSIMI2 


234 


156 


199 


54 


93.1 


81.7 


Wiki5 


545 


30000 


15602 


259 


438.6 


175.7 


Wiki45 


4520 


659388 


66073 


1245 


8089.1 


6201.2 



8.1 Initial SD construction 



Our experiments were conducted over four collections of documents. Table 8J^ summa- 
rizes the size and number of documents in each collection, and the number of nodes and 
load times for the p* and label SDs, which includes computing the SD graph and the 
partitions, and storing the extents in the ElemDB table. 

For measuring times, we conducted five separate runs starting with a cold Java Virtual 
Machine (JVM) for each query. The best and worst times were ignored and the reported 
runtime is the average of the remaining three times. The experiments were carried out 
on a Windows XP machine with a 2.4GHz Intel Core 2 Quad processor, and the JVM 
was allocated 1 GB of RAM. 

The selected collections have different characteristics, namely total size, size and num- 
ber of individual documents, and document heterogeneity. The first collection (RSS2) 
was obtained by collecting RSS feeds from thousands of different sites. The second col- 
lection (PSIMI2) is a fragment of the IntAct PSI-MI datasetr] The third and fourth 
collections (WikiS and Wiki45, respectively) were created from the Wikipedia XML Cor- 
pus provided in INEX 2006 |DG06j . PSIMI2 is a very small collection in terms of number 
of documents (only 156 in total) but a medium-sized collection with respect to total size 
(about 234 MB). In contrast, WikiS is about twice the PSIMI2 size but has almost 200 



^http : //psidev . sourceforge.net/mi/xml/doc/user/ 
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times the number of documents. Consequently, the average document size in both collec- 
tions ranges from 1.5 MB in PSIMI2 to 18 KB in Wiki 5. Documents in RSS2 are similar 
in size to Wiki5. The largest collection (Wiki45 with 4.5GB spanning 660 thousand files) 
is also the one with the smallest average document size (only 6.8 KB). 

The number of nodes in both p* and label SDs provide a measure of heterogeneity 
and structural complexity. PSIMI2 is the most homogeneous of our collections, with only 
54 different element names and 199 different label paths from the root. In contrasts, the 
most heterogenous one is Wiki45 with over one thousand different labels and over 66 
thousand different label paths from the root. 

8.2 Refinements 

We tested the performance of two types of SD updates: refinements and stabilization. In 
this section we discuss the results for refinements and we provide the stabilization results 
in the next one. 



Tables 8.2, 8.3 and 8.4 show the SIDs and EEs of the selected p* SD nodes in our 
test collections. These are the nodes we use for refinements and edge stabilization in our 
experiments reported below. For instance, r468 corresponds to the p* SD node that has 
/rss/channel/image as its EE in the RSS2 collection. Our benchmark refinements were 
selected with scalability in mind: smallest and largest extents and number of documents 
involved are three orders of magnitude apart, ranging from 4 documents in the pigs 



refinement (Table 8.6) to 6509 documents in the r^^g refinement (Table 8.5) 



We evaluated two different types of refinements, one given by a generic AxPRE {p*\c*] 



and the other defined by a very specific one. Tables 8.5 through 8.8 report p*\c* refinement 
times for the selected SD nodes. We choose the p*\c* refinement to show the performance 
with AxPREs involving common axes used throughout the summary literature. 
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Table 8.2: Selected p* SD nodes and EEs from RSS2 



Node 


Extent Expression (EE) 


^468 


/rss/channel/image 


fug 


/rss/channel/item 


^653 


/rss/channel/item/body 


^452 


/rss/channel/item/description 



Table 8.3: Selected p* SD nodes and EEs from PSIMI2 



Node 



Extent Expression (EE) 



P59 
Pl8 
P24 

P193 



/entrySet/entry/interactorList/interactor /organism 

/entrySet/entry/experimentList/experimentDescription/bibref/xref 

/entrySet/entry/experimentList/experimentDescription 

/hostOrganismList/hostOrganism 

/entrySet/entry/interactorList/interactor/organism/cellType 



Table 8.4: Selected p* SD nodes and EEs from WikiS and Wiki45 



Node 



W372 
Wigg 

■W333 
■W967 



Extent Expression (EE) 



/article/body /section/section/section/figure 
/article/body /section/p/sub 
/article/body /section/section/section/section 
/article/body /template/template/wikipedialink 
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Table 8.5: 


RSS2 p*\c* refinements 






p* SD 


p*\c* Refinement 


Node 


Extent Size 


# 
EEs 


Times (s) 


#Docs 


#Elems 


V 


M 


P 


X 


^468 


3296 


3296 


7 


219.2 


101.8 


100.1 


185.7 


r449 


6509 


90583 


201 


14786.2 


598.2 


575.2 


14235.2 


^653 


18 


320 


42 


51.5 


4.5 


3.7 


145.2 


^452 


6253 


82022 


3 


358.4 


189.6 


185.1 


332.7 





Table 8.6: PSIMI2 p 


* c* refinement 


s 




p* SD 


p*\c* Refinement 


Node 


Extent Size 


# 
EEs 


Times (s) 


#Docs 


#Elems 


V 


M 


P 


X 


P59 


156 


24256 


3 


42.8 


29.8 


28.2 


41.1 


Pis 


156 


2072 


2 


32.1 


26.1 


23.7 


29.7 


P24 


156 


2072 


8 


732.4 


157.6 


149.8 


603.4 


P193 


4 


28 


1 


3.9 


3.5 


2.0 


2.8 





Table 8.7: Wiki5p*| 


c* refinements 






p* SD 


p*\c* Refinement 


Node 


Extent Size 


# 
EEs 


Times (s) 


#Docs 


#Elems 


V 


M 


P 


X 


W372 


252 


522 


16 


295.8 


26.3 


25.7 


10.1 


Wl99 


463 


2194 


4 


448.4 


33.8 


29.9 


3.4 


"^333 


128 


500 


61 


2138.9 


87.8 


79.1 


308.2 


^967 


155 


241 


6 


235.9 


12.9 


10.4 


2.3 
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Table 8.8: Wiki45 p*\c* refinements 



p* SD 




p*\c* 


Refinement 




Node 


Extent Size 


# 
EEs 


Times (s) 


#Docs 


#Elems 


V 


M 


P 


X 


^372 


898 


2166 


37 


1449.1 


455.2 


446.1 


144.1 


WlQQ 


1479 


6963 


14 


2493.5 


773.2 


748.7 


64.8 


W333 


736 


3714 


203 


12813.4 


574.7 


573.5 


6602.3 


W967 


2330 


3662 


8 


1835.9 


569.4 


552.3 


20.6 



Tables 8.5 through 8.8 are divided into two parts, the first half provides information 



on the number of documents and elements in the extent of the p* SD nodes being refined 
(t^ Docs and ^ Elems columns, respectively) , and the second half contains numbers 
relative to the p*\c* refinement itself. The numbers under ^ Docs indicate how many 
documents need to be opened to evaluate the refinement. The number of new SD nodes 
created by the refinements (which is the same as the number of EEs evaluated) are 
reported in the 7^ EEs columns. For instance, the p*\c* refinement partitions node r^^g 
into 201 new SD nodes, which means that 201 XPath expressions have to be evaluated 
in 6509 documents in order to obtain the p*\c* of node r^^g. In general, refinement 
times increase proportionally to the number of documents that need to be opened for 
computing the refinement. 

We consider two scenarios, one in which extents are materialized in the ElemDB table 
(reported under columns V, M and P), and another in which the extents are virtual and 
are thus represented only by the EEs (reported under columns X). Times reported in 
V, M and P columns comprise locating the affected files using the SD, opening them 
and evaluating the EE in order to update the materialized extent information in the 
ElemDB table. In addition to extent updates, columns V and M times include edge 



computations using Algorithms 7.3 and 7.2, respectively. (The labels V and M, which 



stand for "virtual" and "materialized", refers only to the different approaches to edge 
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Table 8.9: RSS2 AxPRE refinements 



p* SD 
Node 


Refining 
AxPRE 


Resulting Extent 


Times (s) 


#Docs 


#Elems 


P 


^^468 


c[title] .fs[url] .fs[link] .fs[width] 
.fs [height] .fs [description] 


172 


172 


3.9 


?^449 


c[enclosure] .fs [enclosure] .fs [enclosure] 


9 


37 


10.8 


^653 


fc[p].ns[p].ns[img] 


6 


26 


0.4 


r452 


fs[nnk] 


688 


13885 


10.1 



computation). In contrast, times under the P column correspond to extent computation 
only (without edges). Comparing column P against columns V and M gives us an idea 
of how much overhead DescribeX incurs on the edges. Finally, the X column displays 
how long it takes just to obtain the expressions for both edges and extents under the 
virtual approach. Thus, the X column corresponds to a "purely virtual" approach in 
which no materialization is used for either edges nor extents. Since edges are computed 
from the EEs, the SD graph is still maintained. 

The time differences between the V and M columns come from the fact that com- 
puting the edges between the new SD nodes using XPath is usually more costly than 
computing them from the information stored in the ElemDB table. However, we are not 
aware of any technique for computing general XPath expressions from the region encod- 
ings in the ElemDB table, so using just the materialized extents is not always possible. 



Tables |8.9| through |8.12| report refinements that were chosen to study SDs involving 
novel axes (e.g., fc, fs, ns) and more expressive AxPREs with label predicates. The 
tables show the refining AxPRE for each p* SD node, the number of documents and 
elements that contain neighbourhoods matching the entire AxPRE (7^ Docs and # 
Elems columns, respectively), together with how long it takes to compute the extent 
(Times column). For any given expression, the number of elements with either empty 
neighbourhoods or matching prefixes of the AxPRE is the complement of the number re- 
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Table 8.10: PSIMI2 AxPRE refinements 



p* SD 
Node 


Refining 
AxPRE 


Resulting Extent 


Times (s) 


#Docs 


#Elems 


P 


P59 


c [name] . fs [cellType] 


2 


14 


27.3 


Pl8 


c [primary Res] .fs [second aryRef] 


12 


20 


29.9 


P24 


c[name] .ns[cellType] .ns[tissue] 


4 


4 


27.1 


P193 


c[name].ns[xref] 


4 


28 


3.4 



Table 8.11: Wiki5 AxPRE refinements 



p* SD 
Node 


Refining 
AxPRE 


Resulting Extent 


Times (s) 


#Docs 


#Elems 


P 


W372 


c [caption] .c[collectionLink] 
.fs[br] .fs[collectionLink] 


2 


2 


1.9 


W199 


c[sub] .c[sub] .fs[sub] 


1 


1 


2.1 


W333 


c[title].fs[p].fs[p].fs[p] 


39 


79 


1.1 


W967 


c[br] fs[collectionLink] 
.fs[collectionLink] 


4 


6 


1.2 



Table 8.12: Wiki45 AxPRE refinements 



p* SD 
Node 


Refining 
AxPRE 


Resulting Extent 


Times (s) 


#Docs 


#Elems 


P 


^^^372 


c [caption] .c[collectionLink] 
.fs[br] .fs[collectionLink] 


3 


3 


33.1 


Wigg 


c[sub] .c[sub] .fs[sub] 


3 


3 


39.0 


W333 


c[title].fs[p].fs[p].fs[p] 


155 


320 


28.2 


W967 


c[br] fs[collectionLink] 
.fs[collectionLink] 


9 


11 


57.3 
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ported under 7^ Elems. For instance, the r449 row of Table 8.9| indicates that 37 elements 



in 9 documents have exact c[enclosure].fs[enclosure].fs[enclosure] neighbourhoods and 
obtaining them from the r449 extent takes 10.8 seconds. In addition, we know that the 
number of elements either matching prefixes or with empty neighbourhoods is 90546, 



which comes from the number in column ^ Elems and row r449 in Table 8.5 (90583) 



minus the number in column 7^ Elems and row r449 in Table 8.9| (37). Such subtraction 
would not be meaningful for the 7^ Docs columns because the same document may con- 
tain elements in different extents (remember that an SD contains a partition of elements, 
not documents, so document extents may overlap). 

These results suggest that, even though computing generic refinements like p*\c* may 
be expensive, more specific refinements can be performed in less than a minute and many 
of them in just a few seconds for the smaller test collections. 

8.3 Edge stabilization 

In this section, we report experimental results for stabilization of SD edges from our 
selected p* nodes. 



Tables 8.13 through 8.16 report edge stabilization times and extent sizes for the 
selected SD nodes. The edge stabilized is indicated in the tables by an AxPRE containing 
the axis and the label of the target node. The four Resulting Extents columns show 
the number of document and elements that do contain the edge and the number of those 
that do not. The times reported under columns V and M correspond to the materialized 
extent approach with edge computation using EEs (the former) and the ElemDB table 
(the latter), as explained in the previous section for refinements. 

We stabilize two different edges for some p* SD nodes. After one edge stabilization, 
the resulting SD node that does not have the stabilized edge is indicated by the SID 
with an apostrophe. The second edge stabihzed always corresponds to a node with an 
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Table 8.13: RSS2 edge stabilization 



p* SD 
Node 


Edge 
Stabilized 


Resulting Extents 


Times 


With Edge 


Without Edge 


(s) 


#Docs 


#Elems 


#Docs 


#Elems 


V 


M 


^^468 
^468 


c [description] 
c[link] 


492 

2792 


492 
2792 


2804 
12 


2804 
12 


4.1 
4.5 


0.5 
0.2 


^449 
^449 


ps[iteni] 
c[body] 


6263 

15 


84063 
15 


6509 
6494 


6520 
6505 


12.5 
10.9 


2.9 
3.7 


^-653 
^653 


d[img] 
d[table] 


12 

7 


12 

7 


10 
3 


201 

14 


0.5 
0.4 


0.3 
0.2 


^-452 


c[br] 


12 


12 


6249 


81968 


12.4 


5.9 



Table 8.14: PSIMI2 edge stabilization 



p* SD 
Node 


Edge 
Stabilized 


Resulting Extents 


Times 


With Edge 


Without Edge 


(s) 


#Docs 


#Elems 


#Docs 


#Elems 


V 


M 


P59 


c[cellType] 


4 


28 


156 


24228 


38.2 


9.8 


Pl8 


c [secondary Ref] 


12 


20 


148 


2052 


25.6 


1.2 


P24 
P24 


c [tissue] 
c[cellType] 


8 
8 


84 
548 


156 
156 


1988 
1440 


25.4 
23.4 


1.3 
1.2 
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Table 8.15: WikiS edge stabilization 



p* SD 
Node 


Edge 
Stabilized 


Resulting Extents 


Times 


With Edge 


Without Edge 


(s) 


#Docs 


#Elems 


#Docs 


#Elems 


V 


M 


W372 

"^372 


d [collectionLink] 
d [small] 


335 
3 


592 

5 


695 
694 


1574 
1569 


34.7 
33.9 


2.3 
2.2 




c[sub] 
c [small] 


28 
18 


33 

83 


1469 
1454 


6930 

6847 


41.7 
43.7 


5.4 
5.4 


^333 


c[outsideLink] 
c [unknownLink] 


34 

68 


83 
131 


724 

705 


3631 
3500 


31.9 
29.6 


3.6 
3.8 


■"^967 
«^967 


c [template] 
c[sup] 


26 

174 


27 
246 


2304 
2130 


3635 
3389 


61.2 
60.2 


3.5 

3.4 



apostrophe from the previous stabilization. For instance, the first edge stabilized from 



node r449 (Table 8.13) was the ps edge to an item node, which resulted in two SD nodes: 
one containing a stable ps edge with 84063 elements in its extent, and another one (^449) 
with no edge and 6520 elements. From node r44g we stabilize then the c edge to a body 
node obtaining again two nodes: one with a stable c edge with 15 elements in its extent, 
and the other one with 6505 elements and no edge. The time for computing the ps edge 
stabilization is 12.5 seconds when computing the edges with EEs, and 2.9 seconds when 
using the ElemDB table. The times for the c edge stabilization are 10.9 and 3.7 seconds 
respectively. 

Our results show that DescribeX can provide interactive response times (from sub 
second to just a few seconds) for all edge stabilizations tested when using the materialized 
approach for both extents and edges. Moreover, when using the more expensive EE-based 
approach for finding the SD edges, we still obtain response times in the order of a minute 
in the vast majority of test cases. This is compelling evidence that DescribeX can be 
used in scenarios in which SDs need to be manipulated interactively in order to selectively 
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Table 8.16: Wiki45 edge stabilization 






p* SD 
Node 


Edge 
Stabilized 


Resulting Extents 


Times 

(s) 


With Edge 


Without Edge 


#Docs 


#Elems 


#Docs 


#Elems 


V 


M 


''372 
'^372 


d [collectionLink] 
d [small] 


125 
2 


207 

4 


169 
169 


315 
311 


2.4 
2.2 


0.6 
0.5 


^^199 


c[sub] 
c [small] 


3 

5 


3 
35 


462 

458 


2191 
2156 


2.7 
2.6 


0.8 
0.8 


'^333 
^^333 


c[outsideLink] 
c [unknownLink] 


7 
10 


12 
14 


126 
123 


488 
474 


1.7 
1.4 


0.7 
0.6 


'^967 
^^967 


c [template] 
c[sup] 


4 
66 


5 
123 


151 
85 


236 
113 


1.2 
1.4 


0.6 
0.5 



explore the structure of an XML collection (e.g., aggregating thousands of RSS feed from 
dozens of content providers). 



8.4 XPath query evaluation using SDs 

In this section, we provide performance results for obtaining answer documents for several 
XPath queries using a variety of SDs. These results considerably expand the preliminary 
study presented in |CR07j . 



Tables 8.17, 8.18, and 8.19 show the twelve queries in our benchmark (the structural 



subqueries appear in black, the non-structural predicates are in grey). These queries 
were selected to show how the system scales with respect to key query parameters like 
answer size and number of candidate documents (those that provide a non-empty answer 
for the structural subquery). Our benchmark focuses on the navigational features of 
XPath, following the approach of the MemBeR XQuery Micro-Benchmark |AMM05] . 
which provides a form of standardization for studying XQuery evaluation. 
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Table 8.17: RSS collection queries 



Query 



XPath Expression 



Rl 



/rss/channel/image[title/f ollowing-sibling: :url/f ollowing-sibling: : 
link/following-sibling: : width/following-sibling: : height 
/following-sibling: : description] [width < height] 



R2 



R3 



/rss/channel/item [enclosure] [enclosure/following-sibling : : enclosure 
/following-sibling: : enclosure] [enclosure/@type='audio/mpeg'] 
/rss/channel/item/body [child: :*[1] [self: :p] /following-sibling: :*[1] 
[self: :p] /following-sibling: :*[1] [self: :inig]] [img[width=height]] 



R4 



/rss/channel/item/description [following-sibling: :link] 
[contains (. , '2005')] 



Table 8.18: PSIMI collection queries 



Query 



XPath Expression 



PI 



P2 



/entrySet /entry/inter act orList /interact or /organism [names 
/following-sibling: :cellType] [contains (. , 'Cercopithecus')] 
/entrySet /entry/experimentList/experimentDescription/bibref/xref 
[primaryRef /following-sibling: : secondaryRef ] 
[secondaryRef/@refType= 'method reference'] 



P3 



/entrySet /entry/experimentList/experimentDescript ion 
/hostOrganismList/hostOrganism [child : : names/following-sibling : : * [1] 
[self: :cellType] /following-sibling: :*[1] [self: : tissue]] 
[tissue [contains ( . , 'endothelium' )]] 



P4 



/entrySet /entry/inter act orList /interact or /organism/cellType [names 
/following-sibling: :* [1] [self : :xref]] [contains ( . , 'Cercopithecus')] 
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Table 8.19: Wikipedia collections queries 



Query 



XPath Expression 



Wl 



/article/body/section/section/section/figure [capt ion/collect ionl ink 
/following-sibling: :br/following-sibling: :collectionlink] 
[contains (. , 'Loutherbourg')] 



W2 



/article/body/section/p/sub [child: : sub/child: : sub 
/following-sibling: :sub] [sub/sub='2'] 



W3 



/article/body/section/section/section/section [child : : title 
/following-sibling: :p/following-sibling: :p/following-sibling: :p] 
[contains ( . , 'extinction' )] 



W4 



/article/body/template/template/wikipedialink [following-sibling: 
collectionlink] [containsC ., 'William de Longespee')] 



Tables 8.20 through 8.23 show the times for obtaining the answer documents and 
evaluating the queries in our collections using a variety of SDs. The SD column indicates 
the type of SD used to obtain the candidate documents (next column) on which the entire 
query is evaluated. The three columns under Answer show the time it takes to evaluate 
the query in the candidate documents, and the number of documents and elements in 
the final answer. Since these are XPath queries, the number of documents and elements 
returned by each query are independent of the SD used for evaluation. 

Each row of SD, 7^ Candidate Docs and Times corresponds to a different SD 
used for evaluating the query. The "label" row in each section shows the evaluation 
times when using the label SD node corresponding to the element returned by the query. 
For instance, query R2 returns "item" elements, so the extent documents used are those 
from the "item" node in the label SD (8122 documents in total), taking 19.9 seconds 
to evaluate the query on them. The p* rows report the respective numbers when using 



the p* node whose AxPRE contains the query (note that the SIDs from Tables 8.2, 8.3 



and 8.4 are indicated). For instance, for query R2 we use node rag from the p* SD, 
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Table 8.20: RSS2 q 


uery results 


and times 




Query 


SD 


Candidate 


Answer 


# Docs 


Times (s) 


# Docs 


# Elems 


Rl 


label 


3518 


7.7 


79 


79 


P* (rms) 


3296 


7.4 


p*\c* 


387 


1.3 


specific 


172 


0.6 


R2 


label 


8122 


19.9 


6 


32 


P* (rug) 


6509 


15.1 


p*\c* 


181 


1.2 


specific 


9 


0.1 


R3 


label 


31 


0.4 


6 


26 


P* (^653) 


18 


0.3 


p*\c* 


15 


0.3 


specific 


6 


0.1 


R4 


label 


8221 


19.7 


241 


1344 


P* (^452) 


6253 


14.1 


p*\c* 


6253 


14.6 


specific 


688 


2.0 
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Table 8.21: PSIMI2 query results and times 



Query 


SD 


Candidate 


Answer 


# Docs 


Times (s) 


# Docs 


# Elems 


PI 


label 


156 


45.9 


2 


14 


P* (P59) 


156 


45.7 


p*\c* 


156 


45.7 


specific 


2 


2.5 


P2 


label 


156 


45.7 


4 


8 


P* (pis) 


156 


45.5 


p*\c* 


12 


17.1 


specific 


12 


17.1 


P3 


label 


156 


45.2 


4 


4 


P* (P24) 


156 


44.9 


p*\c* 


6 


6.5 


specific 


4 


5.8 


P4 


label 


8 


9.8 


1 


1 


P* (Pl93) 


4 


4.9 


p*\c* 


4 


4.8 


specific 


4 


4.9 
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taking 15.1 seconds to evaluate the query on the 6509 documents in the extent of r^^g. 
Similarly, p*\c* rows show the evaluation times when using p*\c* SD nodes (there may be 
more than one containing the query). For instance, the p*\c* node(s) used for query R2 
have 181 documents and evaluating R2 on them takes 1.2 seconds. Finally, the last row 
in each section labeled "specific" shows DescribeX performance when using an AxPRE 
refinement obtained from the structural subquery. For instance, for query R2 the refining 



AxPRE would be c[enclosure].fs[enclosure\.fs[enclosure] (row r449 in Table 8.9) which 
has 9 documents in its extent and evaluating R2 on them takes just 0.1 seconds. This is 
the AxPRE we obtain by adapting the SD to R2. 

Not surprisingly, our results indicate that query evaluation performance gains are 
heavily dependant on both the query and the collection. In some cases, just having the 
label SD is description enough and provides good performance, whereas the label SD is 
not of much help in others. For instance, using the most specific SD for PSIMI2 query 



P4 (Table 8.21) only reduces query evaluation time by less than 50% over the label SD. 



At the other end of the spectrum, using the most specific SD for query Wl on Wiki45 



(Table [8723 ) produces a performance improvement of almost four orders of magnitude, 
going from half an hour (label SD) to sub-second (specific SD) evaluation time. In that 
same table, there are also cases (like query W3) in which a p* by itself provides a big gain, 
whereas the most specific SD only brings a modest further improvement. In contrast, 
query W4 gets the greatest gain from the most specific SD (over two orders of magnitude 
against both the p* and the p*\c* SDs). 

These results show that, even though creating the most refined SD is not always 
valuable, having the right SD for the right query does have an important impact on 
the overall performance, and DescribeX provides a powerful mechanism for defining and 
creating them. 
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Table 8.22: WikiS q 


uery results 


and times 




Query 


SD 


Candidate 


Answer 


# Docs 


Times (s) 


# Docs 


# Elems 


Wl 


label 


13288 


54.3 


1 


1 


P* (^372) 


242 


2.5 


p*\c* 


5 


0.2 


specific 


2 


0.2 


W2 


label 


1336 


5.9 


1 


1 


P* {W199) 


463 


2.2 


p*\c* 


1 


0.2 


specific 


1 


0.2 


W3 


label 


25192 


97.8 


1 


1 


P* {W333) 


128 


1.4 


p*\c* 


92 


1.1 


specific 


39 


0.6 


W4 


label 


5370 


25.7 


1 


1 


P* {wgev) 


155 


1.4 


p*\c* 


155 


1.5 


specific 


4 


0.2 



Chapter 8. Experimental results 



117 





Table 8.23: Wiki45 


:iuery results 


and times 




Query 


SD 


Candidate 


Answer 


# Docs 


Times (s) 


# Docs 


# Elems 


Wl 


label 


182598 


1775.8 


1 


1 


P* (^372) 


898 


30.9 


p*\c* 


7 


0.3 


specific 


3 


0.2 


W2 


label 


7341 


131.5 


1 


1 


P* {W199) 


1479 


37.8 


p*\c* 


13 


0.8 


specific 


3 


0.3 


W3 


label 


459296 


3541.0 


1 


1 


P* {W333) 


736 


25.8 


p*\c* 


442 


16.5 


specific 


155 


5.4 


W4 


label 


61183 


872.9 


1 


1 


P* {wgev) 


2330 


65.1 


p*\c* 


2330 


67.3 


specific 


9 


0.4 
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Table 8.24: System comparison: SD graph construction times (s) 



Collection 


Size (MB) 


DescribeX 


XSum 


XMarkl 


115 


17.3 


12.8 


XMarkS 


580 


60.8 


62.2 


XMarklO 


1150 


118.1 


122.1 



8.4.1 Comparison with summary proposals 



The results in Tables 8.20 through 8.23 also provide a comparison with the summary 
literature. Proposals like 1-index |MS99j . APEX |CMS02] . A(k)-index |KSBG02j . and 



D(k)-index |QLO03] can provide, at best, a description equivalent to the p* SD and thus 
a similar performance to that reported on the first row of each query. The p*\c* rows give 
an indication of the performance provided by the F+B- Index |KBNK02] . DescribeX can 
create SDs tailored to a workload that yield query evaluation times one to three orders 
of magnitude faster than these proposals (last row of each query). Using a precise SD 
can have a significant impact on both candidate and answer documents selection, and 
thus on overall query evaluation. Note that no summary in the literature (even recent 
proposals that cluster together nodes with the same subtree structure |BCF+05j ) can 
capture AxPREs such as c\fs.fs or fens. 

In addition, we compared DescribeX's initial construction time against an open-source 
XML summarization tool, XSum |ABMP08] . which constructs an annotated p* SD graph 



a dataguide). Table 8.24 shows comparable results for SD graph construction times be- 



tween DescribeX and XSum. We restricted the comparison to SD graph construction 
times because XSum does not store either the materialized extents or the EEs; it only 
creates a p* SD graph. To the best of our knowledge, this is the only structural summa- 
rization system publicly available. Moreover, no other work in the extensive literature on 
summaries |(;W97[ IMSMI IKBNKn21 IKSB(;n21 [QLOOaj lBCF+n5[ IPGOBb] reports con- 
struction times for their systems. 
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Since XSuin can only summarize individual files, we were not able to test it with our 
benchmark collections. Thus, we decided to do the comparative evaluation using the 
XMark benchmark |BCF"'"03] . which creates one single file of a chosen size. 

These results show that DescribeX provides SD graph construction times comparable 
to an open-source structural summarization tool that is tailored to only one particular 
kind of SD (p*). 

8.4.2 Comparison with XPath evaluators 

We performed a comparative analysis against two DB systems, one commercial (X- 
Hive/DBrl), and the other one open source (XQuest DBrl). X-Hive/DB and XQuest 
DB were selected because of their good performance in published XQuery benchmarks 
[AFM06J . In addition, a comparison against a Saxoiy evaluation without summaries 
is provided. Saxon was selected for being a popular processor that can also evaluate 
XQuery and XSLT in a file-at-a-time fashion. Saxon is the XPath processor integrated 
in the DescribeX' default implementation (see Chapter [T]), but for this comparison we 
use the XPath processor stand-alone. 

Keep in mind that the selected DB-like XML processors may have additional func- 
tionality (such as transaction processing capabilities). The comparison aims to show that 
the DescribeX architecture with the default implementation (combining summaries with 
Saxon) can achieve results competitive with that of XML indexing engines, even with 
gigabyte sized collections. In addition, comparing against Saxon provides a performance 
base line for a file-at-a-time evaluation when the collection is stored as XML text files in 
the file system and no summary structures are available. The results confirm that, with- 
out summaries, Saxon itself lags by several orders of magnitude. We also tried to run our 



^http : //www . x-hive . com/product s/db/ 
■^http://www.axyana. com/xquest/ 
^http : //saxon . sourcef orge . net/ 
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Table 8.25: RSS2 query evaluation comparative times (s) 



Query 


DescribeX 


X-Hive 


XQuest 


Saxon 


Rl 


0.6 


7.9 


3.1 


91 


R2 


0.1 


7.4 


2.9 


93 


R3 


0.1 


7.5 


2.0 


92 


R4 


2.0 


7.6 


2.6(*) 


92 



Table 8.26: Wiki5 query evaluation comparative times (s) 



Query 


DescribeX 


X-Hive 


XQuest 


Saxon 


Wl 


0.2 


25.3 


12.2 


337 


W2 


0.2 


26.6 


15.7 


342 


W3 


0.6 


24.7 


7.5(*) 


354 


W4 


0.2 


25.8 


6.5(*) 


350 



queries on DB2 v£[j but the version we currently have does not support following-sibling 
or preceding-sibling axes, so our benchmark queries could not be run on DB2. 



Tables 8.25 and 8.26 report the times for selecting answer documents using DescribeX, 
X-Hive/DB , XQuest DB , and Saxon (without summaries) on the RSS2 and Wiki5 
collections, respectively. Comparative times for Wiki45 are not reported because neither 
XHive/DB nor XQuest DB could load the entire collection. XQuest DB returned an 
incorrect answer for some of the queries, which are marked with an asterisk. DescribeX 
times span selecting the answer documents and evaluating the entire query using the 



most refined SD (i.e., the "specific" AxPRE refinements reported in Tables 8.20, 8.21 



and 8.22). These times are obtained by adding up the times for getting the candidate 



documents and the times for evaluating the entire query on them (using Saxon). 



^http : //www-306 . ibm. com/sof tware/data/db2/9/ 
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The extensive empirical study presented here shows that DescribeX's file-at-a-time 
XPath evaluation architecture can be a competitive alternative (in terms of query re- 
sponse times) to DB-like XML query engines, even on gigabyte sized collections. Our 
experimental results also demonstrate that DescribeX's powerful mechanism for adapt- 
ing summaries to a workload can provide speedups of one to three orders of magnitude 
compared to other proposals. 



Chapter 9 



Conclusion 



This thesis focuses on addressing the need to describe the actual heterogeneous structure 
of web collections of XML documents. Understanding the metadata structure of such 
collections is fundamental for writing meaningful XPath queries and evaluating them 
efficiently. We propose a novel framework for describing the structure of a web collection 
based on highly customizable summaries that can be conveniently tailored by axis paths 
regular expressions (AxPREs). 

Our main results demonstrate the scalability of the AxPRE summary refinement and 
stabilization (the key enablers for tailoring summaries) using gigabyte XML collections. 
In addition, DescribeX's powerful mechanism for adapting summaries to a workload 
can provide speedups of one to three orders of magnitude compared to other proposals. 
The experiments also show that DescribeX's file-at-a-time XPath evaluation architec- 
ture (supporting fast evaluation of complex XPath workloads over large web document 
collections) can be a very competitive alternative (in terms of query response times) to 
DB-like XML query engines, even on gigabyte sized collections. 

Familiar research issues can be re- visited in the context of AxPRE summaries, such as 
providing guidelines for selecting good summaries (similar to schema design) and infer- 
ring general and succinct AxPRE expressions from an XML collection (similar to DTD 

122 
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inference from instances). Developing tools for metadata management is also addressed 
by a recent schema summarization proposal |YJ06j . In this direction, creating summaries 
that describe how metadata labels (including some generated using schema abstraction 
and summarization techniques) are used in a given instance seems promising. 

In the context of XML messaging, we came across the problem of doing schema 
mapping when the schemas are too general and only very small subsets are normally 
used. The schema mapping problem consists of defining correspondences between two 
schemas in order to translate data from one to the other |PVM"'"02] . If we need to define 
a complete mapping between two very lax, broad schemas, we will end up with a large 
number of correspondences that are irrelevant for any single instance. An interesting 
research direction would be to develop a strategy to do summary mapping in the same 
spirit of schema mapping, perhaps using EEs definitions to create the correspondences 
in XPath. Another option would be to use DescribeX summaries to determine what 
schema elements do not apply to a given collection and then only define correspondences 
for those elements that are actually used. This would significantly reduce the number 
of correspondences needed to define a meaningful mapping hence simplifying the overall 
data translation process. 

The notion of bisimulation originated in fields other than databases (concurrency 
theory, verification, modal logic, set theory), where it continues to find applications. It 
would be interesting to explore whether the more flexible notion introduced in this thesis 
(selective bisimilarity applied to subgraphs described by AxPREs) can also find novel 
applications in such areas. 

Since this XPath-to-AxPRE syntactic translation can be applied to any XPath query, 
it can also be used to translate XPlainer queries |CLR07] to AxPREs. XPlainer ex- 
pressions have the same syntax as XPath but a different semantics which provides an 
explanation in the form of the intermediate nodes, a kind of data provenance of the 
answer. 
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Open research issues also include creating AxPREs for the XPlainer expressions of 
a query, so that DescribeX can adapt SDs to accelerate the retrieval of intermediate 
nodes. In addition, we plan to study the impact of adjusting the workload (e.g, by finding 
frequent patterns), and also how to optimize SD selection given budget constraints. There 
are also opportunities for exploiting the flexibility available in AxPRE summaries in the 
context of the more traditional summary applications to indexing, selectivity estimation, 
and query optimization. 



Bibliography 



[ABMP07] Andrei Arion, Veronique Benzaken, loana Manolescu, and Yannis Papakon- 
stantinou. Structured materialized views for XML queries. In Proceedings of 
the 33rd International Conference on Very Large Data Bases, 2007. 

[ABMP08] Andrei Arion, Angela Bonifati, loana Manolescu, and Andrea Pugliese. Path 
summaries and path partitioning in modern XML databases. World Wide 
Web, 11(1):117-151, 2008. 

[ACKR08] M. S. Ali, Mariano P. Consens, Shahan Khatchadourian, and Flavio Rizzolo. 
DescribeX: interacting with AxPRE summaries. In Proceedings of the 24th 
International Conference on Data Engineering (Demonstration), 2008. 

[ADR+04] Giuseppe Amato, Franca Debole, Fausto Rabitti, Pasquale Savino, and Pavel 
Zezula. A signature-based approach for efficient relationship search on XML 
data collections. In Second International XML Database Symposium, XSym, 
pages 82-96, 2004. 

[AFM06] Loredana Afanasiev, Massimo Franceschet, and Maarten Marx. XCheck: 
a platform for benchmarking XQuery engines. In Proceedings of the 32nd 
International Conference on Very Large Data Bases, pages 1247-1250, 2006. 

[AKJP+02] Shurug Al-Khalifa, H. V. Jagadish, Jignesh M. Patel, Yuqing Wu, Nick 
Koudas, and Divesh Srivastava. Structural joins: A primitive for efficient 

125 



Bibliography 126 

XML query pattern matching. In Proceedings of the 18th International Con- 
ference on Data Engineering, pages 141-, 2002. 

[AMM05] Loredana Afanasiev, loana Manolescu, and Philippe Michiels. MemBeR: 
A micro-benchmark repository for XQuery. In Third International XML 
Database Symposium, XSym, pages 144-161, 2005. 

[BCF+03] Ralph Busse, Mike Carey, Daniela Florescu, Martin Kersten, loana 
Manolescu, Albrecht Schmidt, and Florian Waas. XMark: An XML bench- 
mark project, http://www.xml-benchmark.org/, 2003. 

[BCF^OS] Peter Buneman, Byron Choi, Wenfei Fan, Robert Hutchison, Robert Mann, 
and Stratis Viglas. Vectorizing and querying large XML repositories. In 
Proceedings of the 21st International Conference on Data Engineering, pages 
261-272, 2005. 

[BCM05] Attila Barta, Mariano P. Consens, and Alberto O. Mendelzon. Benefits 
of path summaries in an XML query optimizer supporting multiple access 
methods. In Proceedings of the 31st International Conference on Very Large 
Data Bases, pages 133-144, 2005. 

[Ber94] Elisa Bertino. Index configuration in object-oriented databases. VLDB Jour- 

nal, 3(3):355-399, 1994. 

[BFK05] Michael Benedikt, Wenfei Fan, and Gabriel M. Kuper. Structural properties 
of XPath fragments. Theoretical Computer Science, 336(1), 2005. 

[BKS02] Nicolas Bruno, Nick Koudas, and Divesh Srivastava. Holistic twig joins: 
Optimal XML pattern matching. In Proceedings of the 2002 ACM SIC MOD 
International Conference on Management of Data, pages 310-321, 2002. 



Bibliography 127 

[BNST06] Geert Jan Bex, Frank Neven, Thomas Schwentick, and Karl Tuyls. Inference 
of concise DTDs from XML data. In Proceedings of the 32nd International 
Conference on Very Large Data Bases, pages 115-126, 2006. 

[BOB+04] Andrey Balmin, Fatma Ozcan, Kevin S. Beyer, Roberta Cochrane, and 
Hamid Pirahesh. A framework for using materialized XPath views in XML 
query processing. In Proceedings of the 30th International Conference on 
Very Large Data Bases, pages 60-71, 2004. 

[BW03] Michael Barg and Raymond K. Wong. A fast and versatile ath index for 
querying semi-structured data. In Proceedings of the 8th DASFAA, pages 
249-256, 2003. 

[CLR07] Mariano P. Consens, John W. Liu, and Flavio Rizzolo. XPlainer: Visual 
explanations of XPath queries. In Proceedings of the 23rd International 
Conference on Data Engineering, 2007. 

[CM94] Mariano P. Consens and Tova Milo. Optimizing queries on files. In Proceed- 

ings of the 1994 ACM SICMOD International Conference on Management 
of Data, pages 301-312, 1994. 

[CMS02] Chin- Wan Chung, Jun-Ki Min, and Kyuseok Shim. APEX: An adaptive path 
index for XML data. In Proceedings of the 2002 ACM SICMOD International 
Conference on Management of Data, pages 121-132, 2002. 

[CR07] Mariano P. Consens and Flavio Rizzolo. Fast answering of XPath query 

workloads on web collections. In Fifth International XML Database Sympo- 
sium, XSym, 2007. 

[CRV08] Mariano P. Consens, Flavio Rizzolo, and Alejandro A. Vaisman. AxPRE 
summaries: Exploring the (semi-) structure of XML web collections. In Pro- 
ceedings of the 24th International Conference on Data Engineering, 2008. 



Bibliography 128 

[CSF+01] Brian F. Cooper, Neal Sample, Michael J. Franklin, Gisli R. Hjaltason, and 
Moslie Sliadmon. A fast index for semistructured data. In Proceedings of 
the 27th International Conference on Very Large Data Bases, pages 341-350, 
2001. 

[CVZ+02] Shu-Yao Chien, Zografoula Vagena, Donghui Zhang, Vassilis J. Tsotras, and 
Carlo Zaniolo. Efficient structural joins on indexed XML documents. In 
Proceedings of the 28th International Conference on Very Large Data Bases, 
pages 263-274, 2002. 

[CYWY03] Jiefeng Cheng, Ge Yu, Guoren Wang, and Jeffrey Xu Yu. PathGuide: An 
efficient clustering based indexing method for XML path expressions. In 
Proceedings of the 8th DASFAA, pages 257-, 2003. 

[DG06] Ludovic Denoyer and Patrick Gallinari. The Wikipedia XML Corpus. SIGIR 

Forum, 2006. 

[DPGM04] Natasha Drukh, Neoklis Polyzotis, Minos N. Garofalakis, and Yossi Matias. 
Fractional XSKETCH synopses for XML databases. In Second International 
XML Database Symposium, XSym, pages 189-203, 2004. 

[DPP04] Agostino Dovier, Carla Piazza, and Alberto Policriti. An efficient algo- 
rithm for computing bisimulation equivalence. Theoretical Computer Sci- 
ence, 311(l-3):221-256, 2004. 

[FGW+07] George H. L. Fletcher, Dirk Van Gucht, Yuqing Wu, Marc Gyssens, Sofia 
Brenes, and Jan Paredaens. A methodology for coupling fragments of XPath 
with structural indexes for XML documents. In Proceedings of the 11th In- 
ternational Symposium on Database Programming Languages, DBPL 2007, 
2007. 



Bibliography 



129 



[GGR'^03] Minos Garofalakis, Aristides Gionis, Rajeev Rastogi, S. Seshadri, and 
Kyuseok Shim. XTRACT: Learning document type descriptors from XML 
document collections. Data Minining and Knowledge Discovery^ 7(l):23-56, 
2003. 

[GKP03] Georg Gottlob, Christoph Koch, and Reinhard Pichler. XPath processing in 
a nutshell. SIGMOD Record, 32(1):12-19, 2003. 

[GKP05] Georg Gottlob, Christoph Koch, and Reinhard Pichler. Efficient algorithms 
for processing XPath queries. ACM Transactions on Database Systems 
(TODS), 30(2):444-491, 2005. 

[GW97] Roy Goldman and Jennifer Widom. Dataguides: Enabling query formulation 
and optimization in semistructured databases. In Proceedings of the 23rd 
International Conference on Very Large Data Bases, pages 436-445, 1997. 

[HU79] John E. Hopcroft and Jeffiey D. Ullman. Introduction to Automata Theory, 

Languages and Computation. Addison-Wesley, 1979. 

[HY04] Hao He and Jun Yang. Multiresolution indexing of XML for frequent queries. 

In Proceedings of the 20th International Conference on Data Engineering, 
pages 683-694, 2004. 

[JLWO03] Haifeng Jiang, Hongjun Lu, Wei Wang, and Beng Chin Ooi. XR-Tree: In- 
dexing XML data for efficient structural joins. In Proceedings of the 19th 
International Conference on Data Engineering, pages 253-263, 2003. 



[JWLY03] Haifeng Jiang, Wei Wang, Hongjun Lu, and Jeffrey Xu Yu. Holistic twig 
joins on indexed XML documents. In Proceedings of the 29th International 
Conference on Very Large Data Bases, pages 273-284, 2003. 



Bibliography 130 

[KBNK02] Raghav Kaushik, Philip Bohannon, Jeffrey F. Naughton, and Henry F. Ko- 
rth. Covering indexes for branching path queries. In Proceedings of the 2002 
ACM SIGMOD International Conference on Management of Data, pages 
133-144, 2002. 

[KBNS02] Raghav Kaushik, Phihp Bohannon, Jeffrey F. Naughton, and Pradeep 
Shenoy. Updates for structure indexes. In Proceedings of the 28th Inter- 
national Conference on Very Large Data Bases, pages 239-250, 2002. 

[KM90] Alfons Kemper and Guido Moerkotte. Advanced query processing in object 
bases using access support relations. In Proceedings of the 16th International 
Conference on Very Large Data Bases, pages 290-301, 1990. 

[KSBG02] Raghav Kaushik, Pradeep Shenoy, Philip Bohannon, and Ehud Gudes. Ex- 
ploiting local similarity for indexing paths in graph-structured data. In Pro- 
ceedings of the 18th International Conference on Data Engineering, pages 
129-140, 2002. 

[KYUOl] Dao Dinh Kha, Masatoshi Yoshikawa, and Shunsuke Uemura. An XML 
indexing structure with relative region coordinate. In Proceedings of the 
17th International Conference on Data Engineering, pages 313-320, 2001. 

[LLCC05] Jiaheng Lu, Tok Wang Ling, Ghee Yong Ghan, and Ting Ghen. From region 
encoding to extended Dewey: On efficient processing of XML twig pattern 
matching. In Proceedings of the 31st International Conference on Very Large 
Data Bases, pages 193-204, 2005. 

[LMOl] Quanzhong Li and Bongki Moon. Indexing and querying XML data for 

regular path expressions. In Proceedings of the 27th International Conference 
on Very Large Data Bases, pages 361-370, 2001. 



Bibliography 131 

[LM03] Quanzhong Li and Bongki Moon. Partition based path join algorithms for 

XML data. In Proceedings of the l^th International Conference on Database 
and Expert Systems Applications, DEXA 2003, pages 160-170, 2003. 

[LSOO] Hartmut Liefke and Dan Suciu. XMILL: An efficient compressor for XML 

data. In Proceedings of the 2000 ACM SIGMOD International Conference 
on Management of Data, pages 153-164, 2000. 

[LWZ06] Laks V.S. Lakshmanan, Hui (Wendy) Wang, and Zheng (Jessica) Zhao. An- 
swering tree pattern queries using views. In Proceedings of the 32nd Inter- 
national Conference on Very Large Data Bases, 2006. 

[MdR05] Maarten Marx and Maarten de Rijke. Semantic characterizations of naviga- 
tional XPath. SIGMOD Record, 34(2):41-46, 2005. 

[MLMK05] Makoto Murata, Dongwon Lee, Murali Mani, and Kohsuke Kawaguchi. Tax- 
onomy of XML schema languages using formal language theory. ACM Trans- 
actions on Internet Technology (TOIT), 5(4):660-704, 2005. 

[MRV04] Alberto O. Mendelzon, Flavio Rizzolo, and Alejandro A. Vaisman. Indexing 
temporal XML documents. In Proceedings of the 30th International Confer- 
ence on Very Large Data Bases, pages 216-227, 2004. 

[MS99] Tova Milo and Dan Suciu. Index structures for path expressions. In Proceed- 

ings of the 7th International Conference on Database Theory, pages 277-295, 
1999. 

[MS05] Bhushan Mandhani and Dan Suciu. Query caching and view selection for 

XML databases. In Proceedings of the 31st International Conference on Very 
Large Data Bases, pages 469-480, 2005. 



Bibliography 132 

[MS07] Anders M0ller and Michael I. Schwartzbach. XML graphs in program analy- 

sis. In Proc. ACM SIGPLAN Workshop on Partial Evaluation and Program 
Manipulation, PEPM '07, 2007. 

[MW95] Alberto O. Mendelzon and Peter T. Wood. Finding regular simple paths in 
graph databases. SI AM Journal on Computing, 24(6): 1235-1258, 1995. 

[NUWC97] Svetlozar Nestorov, Jeffrey D. Ullman, Janet L. Wiener, and Sudarshan S. 
Chawathe. Representative objects: Concise representations of semistruc- 
tured, hierarchial data. In Proceedings of the 13th International Conference 
on Data Engineering, pages 79-90, 1997. 

[ODPC06] Nicola Onose, Alin Deutsch, Yannis Papakonstantinou, and Emiran Curt- 
mola. Rewriting nested XML queries using nested views. In Proceedings of 
the 2005 ACM SIGMOD International Conference on Management of Data, 
2006. 

[PG02] Neoklis Polyzotis and Minos N. Garofalakis. Statistical synopses for graph- 

structured XML databases. In Proceedings of the 2002 ACM SIGMOD In- 
ternational Conference on Management of Data, pages 358-369, 2002. 

[PG06a] Neoklis Polyzotis and Minos N. Garofalakis. XCLUSTER synopses for struc- 
tured XML content. In Proceedings of the 22nd International Conference on 
Data Engineering, 2006. 

[PG06b] Neoklis Polyzotis and Minos N. Garofalakis. Xsketch synopses for xml data 
graphs. ACM Transactions on Database Systems (TODS), 31(3):1014-1063, 
2006. 

[PGI04] Neoklis Polyzotis, Minos N. Garofalakis, and Yannis E. loannidis. Approx- 
imate XML query answers. In Proceedings of the 2004 ACM SIGMOD In- 
ternational Conference on Management of Data, pages 263-274, 2004. 



Bibliography 133 

[PT87] Robert Paige and Robert E. Tarjan. Three partition refinement algorithms. 

SIAM Journal on Computing, 16(6):973-989, 1987. 

[PVM+02] Lucian Popa, Yannis Velegrakis, Renee J. Miller, Mauricio A. Hernandez, 
and Ronald Fagin. Translating web data. In Proceedings of the 28th Inter- 
national Conference on Very Large Data Bases, 2002. 

[QLO03] Chen Qun, Andrew Lim, and Kian Win Ong. D(k)-index: An adaptive 
structural summary for graph-structured data. In Proceedings of the 2003 
ACM SIGMOD International Conference on Management of Data, pages 
134-144, 2003. 

[RMOl] Flavio Rizzolo and Alberto O. Mendelzon. Indexing XML data with ToXin. 

In Proceedings of 4th International Workshop on the Web and Databases, 
pages 49-54, 2001. 

[RM04] Praveen Rao and Bongki Moon. PRIX: Indexing and querying XML using 

prufer sequences. In Proceedings of the 20th International Conference on 
Data Engineering, pages 288-300, 2004. 

[RV08] Flavio Rizzolo and Alejandro A. Vaisman. Temporal XML: Modeling, index- 

ing, and query processing. The International Journal on Very Large Data 
Bases, 2008. 

[Sch04] Thomas Schwentick. XPath query containment. SIGMOD Record, 

33(1):101-109, 2004. 

[SCKT07] Reza Samavi, Mariano Consens, Shahan Khatchadourian, and Thodoros 
Topaloglou. Exploring P SI-MI XML collections using DescribeX. Journal of 
Integrative Bioinformatics, 4(3), 2007. 



Bibliography 134 

[SK85] Nicola Santoro and Ramez Khatib. Labelling and implicit routing in net- 

works. The Computer Journal, 28:5-8, 1985. 

[Val87] Patrick Valduriez. Join indices. ACM Transactions on Database Systems 

(TODS), 12(2):218-246, 1987. 

[VMT04] Zografoula Vagena, Mirella Moura Moro, and Vassilis J. Tsotras. Efficient 
processing of XML containment queries using partition-based schemes. In 
Proceedings of the 8th International Database Engineering and Applications 
Symposium, IDEAS 2004, pages 161-170, 2004. 

[W3C99] W3C. XML Path Language (XPath) 1.0. http://www.w3.org/TR/xpath, 
1999. 

[W3C07] W3C. XML Path Language (XPath) 2.0. http://www.w3.org/TR/xpath20, 
2007. 

[WJLY03] Wei Wang, Haifeng Jiang, Hongjun Lu, and Jeffrey Xu Yu. PBiTree coding 
and efficient processing of containment joins. In Proceedings of the 19th 
International Conference on Data Engineering, pages 39 1-, 2003. 

[WPFY03] Haixun Wang, Sanghyun Park, Wei Fan, and Philip S. Yu. ViST: A dynamic 
index method for querying XML data by tree structures. In Proceedings of 
the 2003 ACM SIGMOD International Conference on Management of Data, 
pages 110-121, 2003. 

[XH94] Zhaohui Xie and Jiawei Han. Join index hierarchies for supporting efficient 

navigations in object-oriented databases. In Proceedings of the 20th Inter- 
national Conference on Very Large Data Bases, pages 522-533, 1994. 



Bibliography 135 

[XO05] Wanhong Xu and Z. Meral Ozsoyoglu. Rewriting XPath queries using mate- 

rialized views. In Proceedings of the 31st International Conference on Very 
Large Data Bases, pages 121-132, 2005. 

[Yan90] Mihalis Yannakakis. Graph-theoretic methods in database theory. In Pro- 
ceedings of the 9th Symposium on Principles of Database Systems, pages 
230-242, 1990. 

[YIISY04] Ke Yi, Hao He, loana Stanoi, and Jun Yang. Incremental maintenance of 
XML structural indexes. In Proceedings of the 2004 ACM SIC MOD Inter- 
national Conference on Management of Data, pages 491-502, 2004. 

[YJ06] Cong Yu and H. V. Jagadish. Schema summarization. In Proceedings of the 

32nd International Conference on Very Large Data Bases, pages 319-330, 
2006. 

[YLT03] Matthew Young-Lai and Frank Wm. Tompa. One-pass evaluation of region 
algebra expressions. Information Systems, 28(3): 159-168, 2003. 

[ZKO04] Ning Zhang, Varun Kacholia, and M. Tamer Ozsu. A succinct physical 
storage scheme for efficient evaluation of path queries in XML. In Proceedings 
of the 20th International Conference on Data Engineering, 2004. 

[ZOIA06] Ning Zhang, M. Tamer Ozsu, Ihab F. Ilyas, and Ashraf Aboulnaga. FIX: 
Feature-based indexing technique for XML documents. In Proceedings of the 
32nd International Conference on Very Large Data Bases, 2006. 



Appendix A 



XPath 1.0 formal semantics 



We provide in this appendix a concise definition of tlie formal semantics of XPath 1.0 
[W3C99J . Several semantic characterizations of XPath 1.0 have been proposed recently 
|GKP03l IMdR05|[BFK05] . As part of the foundation of DescribeX, we have extended the 
XPath formalization given in |GKP05j to better capture all the relevant constructs in the 
standard. A significant addition to the rules is the proper treatment of the interaction 
of parentheses followed by predicates. Parenthesis use in XPath does not just affect 
precedence and grouping of operators, it does in fact change the semantics |CLR07] . 

Since XPath was designed to be embedded in other XML languages, it provides 
information about the context in which an expression will be evaluated. Given that 
XPath manipulates node sets, in addition to the node from which to start the evaluation, 
the context has to contain the node's position relative to a node set and the node set 
size. This node set could be the result of the evaluation of another XPath expression or 
a construct of the host language. 

Definition A.l (Context) Let axis graph A = {Inst, Axes, Label, A) be an axis graph, 
S C Inst* and v E S . The context of v in S with respect to axis is defined as t = 
{v,poSaxis{v, S), \S\) . We say thatv is the context node, poSaxis{v, S) the context position 
of V in S w.r.t. -<axis, and \S\ the context size. D 
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£llocpathj{{v,k,n)) := Vllocpathj{v) (A.l) 

£lposition{)l{{v,k,n)) ■.= k (A.2) 

£llastO]{{v,k,n)) -.= 71 (A.3) 

£lOpieu...,e^mv,k,n)):=nOp]i£le,j{{y^k,n)),...,£leJi{v,k,n))) (A.4) 

Figure A.l: Semantic definitions of XPatli expressions 

Eacli expression evaluates relative to a context and returns a value of one of four types: 
number, node set, string and boolean. Other important XPath syntactic constructs are 
location paths, which are special cases of expressions. Location paths come in two flavors: 
absolute and relative. An absolute location path consists of / optionally followed by a 
relative location path. A relative location path consists of one or more location steps 
separated by /. (Since location steps are expressions, they also evaluate relative to a 
context.) 

We define next the formal semantics of XPath expression, location paths and operator 
with functions £, C and J-". 

Definition A.2 (Semantic Functions £, C and JF) Let Op be a place holder for op- 
erators ArithOp G {+, — , *, div,mod}, RelOp G { =, 7^, <, <, >, >}, EqOp G {=, 7^, }, 
and GtOp G {<,<,>,>}■ Let e, ei . . . e^ he expressions and locpath, locpathi, ..., 
locpathm location paths. The semantics of XPath expressions are defined by semantics 
functions £ and L in Figure A.l and A.S\ and the semantics of operators are defined by 



T in Figures A.3\ and \A.4 Function £ defines the semantics of expressions on a context, 



whereas function C defines the semantics of locations paths on a node. D 

The distinction between context-based and node-based evaluation comes from the 
fact that some functions like positionQ and lastQ need to be evaluated on a context 
(they return the context position and the context size respectively). The evaluation of 
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m 

'D\locpathi\ . . . \locpathrrT\{v) := \\ C\locpathi\{v) (A. 5) 

Cl{locpath)[ei] . . . [e„] ](f) ■= {w\w (^ S h S = Vllocpath]{v) (A.6) 

m 

l\{£lei\{w,poSdoc{w, S), \S\) = true)} 
Cllocpathi/locpath2l{v) := M £|/ocpat/i2](w) (A. 7) 

w(^ Cllocpathi}(v) 

Ll/locpathjiv) := Cllocpathj{vo) (A.8) 

Claxis :: l[ei] . . . [em] ]{v) := {w\w E S A S = {v' \ {v,v') G axis A A(f') = /} (A. 9) 

m 

/\{Eleij{w,poSaxis{w, S), \S\) = true)} 

Figure A. 2: Semantic definitions of XPath location paths 

location paths, on the other hand, requires only the context node. 

Below we illustrate through a series of examples how these semantic functions are used 
for evaluating XPath expressions. The examples cover the following four expressions: find 
all exp Roles, find the last expRole, find the first expRole of each expRoleList, and find 
the first expRole in the entire collection. For these examples we use XPath abbreviated 
syntax and the XML axis graph A of our running example. 

Let us start with an expression with a single step that returns all expRoles in the 
collection. 

Example A.l Let to ^e the context {vq, 1, 1) and let 

ei = descendant :: expRole 

The evaluation of ei on A and to returns all expRoles in the collection. In order to 



evaluate ei on to we apply the semantic rules from Figures A.l and A. 2 Since e is an 



expression containing a location path, the first rule we apply is (A.l) obtaining 



£\descendant :: ea;pi?o/e](to) := C\descendant :: expRole\{vQ) 
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Rule (A.l) translates the evaluation on the entire context to = {vq, 1, 1) to an evaluation 



on just the context node vq. Since ei consists of only one location step, we finish the 



evaluation by applying rule (A. 9) with no predicates [ci] . . . [e^] and get 

C\descendant :: expi?o/e](fo) := 

{w\w & S /\ S = {v\ {vq, v) G descendant A \{y) = expRole}} 
which returns all w's that are descendant expRoles o/vq. D 

Now, we consider a single step expression with a predicate that returns the last 
expRole in the collection. 

Example A. 2 Let to be the context {vq, 1, 1) and let 

62 = descendant :: expRole[position{) = last{)] 
The evaluation of 62 on A and to returns the last expRole in the collection. As in the 



previous example, the application of rule (A.l) transforms the evaluation on context to 



(fo, 1, 1) to an evaluation on node Vq. Since 62 consists only of a location step, we apply 



rule (A. 9) and get 

C\descendant :: expRole\position{) = /ast()]] (t>o) : = 

{wi\wi E S A S = {v\ {vq, v) G descendant A \{y) = expRole} A 

ti = {wi,poSdescendant{wi, S), \S\) A £lposition() = /ast()](ti) = truc} 

which returns allwi 's that are descendant expRoles ofvo and satisfy the predicate positionQ 
lastQ. In order to evaluate this predicate on each Wi, we go back to function S by in- 
voking rule (A. 4^ of Figure A.l with the "=" operator, the new context ti and J-'l = '. 



num X num — >■ hool\ obtaining 

£\{positicm{) = /ast())](ti) := 
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J^l = JiSlposiUonO}{t,),Sllast{)}{ti)) := SiposiUon{)}{ti) = £past()](ti) 
This step of the evaluation checks whether each Wi is in fact the last in the sequence 



of descendant expRoles by invoking Slposition{)l{ti) and Sllast{)}{ti) (rules (A. 2) and 



(A. 3) respectively) and comparing their returned values for equality. If they are equal for 
some Wi the evaluation of the location step returns the wi or else the empty node set. 
This completes the evaluation of 62. □ 

Next, we introduce composition with a more complex expression that returns the first 
expRole of each expRoIeList in the collection. 

Example A. 3 Let to be the context {vq, 1, 1) and let 

63 = descendant :: expRoleList / child :: expRole[l] 

The evaluation of 63 on A and to returns the first expRole of each expRoleList in the 



collection. As in the previous example, the application of rule (A.l) transforms the 



evaluation on context to = (^o, 1, 1) to an evaluation on node vq. Since 63 consists of a 



composition of a location step and a location path, we apply rule (A.l) and get 



C\descendant :: expRoleList / child :: expi?o/e[l]](fo) := 
M C\child :: expRole\V\\{wi) 

wi(iC\descendant::expRoleList\{v({] 

which entails evaluating the location path on the union of all wi 's that are returned by 



the evaluation of the location step. In order to obtain those wi's we apply rule (A. 9) to 
the location step descendant :: expRoleList obtaining 

C\descendant :: expRoleLisi\{vQ) : = 

{wi\wi E S A S = {v\ {vq, v) G descendant A \{v) = expRoleList}} 

and finish with the evaluation of the first part of the composition. Next we evaluate the 



location path by applying rule (A. 9} and get 



C[child :: expRole[l]\{wi) := {w2\w2 & S /\ S = {v \ {wi,v) G child A X{v) = expRole} A 
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t2 = {w2,poSchUd{w2,S), \S\) A Slposition{) = l](t2) = true} 

which returns all W2 's that are child expRoles of the wi 's and satisfy the predicate [1] 
(which is [positionQ = 1] in unabbreviated syntax). In order to evaluate the predicate 



we invoke rule (A. 4) of Figure A.l with the "=" operator, the new context ^2 and JF| 



num X num —>■ boolj and get 

SlipositionQ = l)l(t2) := J^l = JiSlposition{)j{t2), 1) := £lposition{)l{t2) = 1 
This step of the evaluation checks whether the W2 is in fact the first in the sequence of 



child expRoles of wi by invoking Slposition{)l{t2) (rule (A. 2)) and see if it returns 1. 
Since the predicate is part of the location path, it is evaluated on each of the W2 's. Thus, 
the evaluation of e^ will return one W2 (the first one in the sequence) for each wi. D 

Finally, let us illustrate the impact of parentheses by considering an expression that 
returns the first figure in the entire document. 

Example A. 4 Let to be the context {vq, 1, 1) and let 

Ci = {descendant :: expRoleList/ child :: expRole)[l] 

The evaluation ofe^ on A and to returns the first expRole in the entire collection. (Notice 
the difference with the previous example which returns the first expRole of each expRo- 



leList.) As before, we begin the evaluation by invoking rule (A.l). Then we apply rule 



(A. 6) to the parenthesized expression obtaining 

C\{descendant :: expRoleList/ child :: expRole)[l]\{yQ) := 

{w2\w2 E S A S = C\descendant :: expRoleList / child :: expRole^ivo) A 
t2 = {w2,poSdoc{w2,S), \S\) A £lposition{) = iKtg) = true)} 



Now we have to evaluate the composition by invoking rule (A.l) and get 



C\descendant :: expRoleList/ child :: expRole\{vQ) 
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M C\child :: expRole\{wi) 

wi(^C\descendant::expRoleList\{v({) 

which entails evaluating the location path on the union of all wi 's that are returned by 
the evaluation of the location step. Notice that, in contrast with the previous example, the 
predicate is applied to the result of the composition instead of being part of the location 



path. We continue by applying rule (A. 9) to location step descendant :: expRoleList in 



order obtain the Wi 's to be used by the location path obtaining 

C\descendant :: expRoleList\{vQ) : = 
{wi\wi E S /\ S = {v\ {vo,v) G descendant A A(f) = expRoleList}} 



Next we invoke (A. 9) to evaluate the location step child :: expRole and get 

Clchild :: expRole}{wi) := {w2\w2 E S A S = {v\ {wi,v) G descendant A 

\{v) = expRole} A ti = {w2,poSchiid{w2, S), \S\)} 

which returns all W2 that are child expRole of the wi . We then evaluate the predicate 
SKpositionO = l)](t2) as before. However, one difference with the previous example is 
that here there is only one sequence of W2 's (rather than one sequence for each wi). That 
is the reason why the context in the evaluation of positionQ = 1 changed from ti (based 
on the descendant axis) to t2 (based on the entire axis graph). Thus, the evaluation of 
64 returns only one node: the last expRole in the collection. D 
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J^lArithOp : num x num -^ n-um](ni,n2) := rii ArithOp n2 
J-'lconstant number n :— * numl() := n 
J-'lcount : nset -^ numKS) := \S\ 

J-'land : bool x bool -^ boolJ{bi, 62) := &i A 62 

jF|or : bool x bool -^ 600/] (61, 62) •= ^1 V 62 

J-'lnot : bool -^ boolJ{b) := -ib 

J^ftrue :^ boolJO := true 

J^lfalse :^ boolJO := /a/se 

jF[6oo/ean : nset -^ booll{S) := if 5 7^ then true else /a/se 

^[600/ean : str -^ booll{s) := if s 7^ "" then true else false 

J-'lboolean : num -^ booll{n) := if n 7^ and n 7^ NaN then true else false 

J-'lEqOp : bool x (str [J num [J bool) -^ 600/] (6, x) := 6 EqOp jF|6oo/ean] (x) 

jF|i?gOj9 : nwm X {str\Jnum) -^ bool\{n^x) := n EqOp J-'lnumberl{x) 

J-'lEqOp : str x str -^ 600/] (si, S2) '■= Si EqOp S2 

J-'lRelOp : nset X nset -^ 600/] (5*1, S2) := 3t>i G 5*1, f 2 G 6*2 : strvaliyi) RelOp strval{v2) 

J^\RelOp : nset x num -^ booll{S,n) := 3v E S : tojnumber{strval{v)) RelOp n 

J-'lRelOp : nset x str -^ 600/] (S*, s) := 3f G S* : strval{v) RelOp s 

J^lRelOp : nset x 600/ ^ 600/] (5, b) := J^lbooleanj{S) RelOp b 

J-'lGtOp : {str [J num [J bool) x [str [J num [J bool) -^ booll{xi,X2) := J-'lnumberJ{xi) 

GtOp J-'lnumberl{x2) 

Figure A. 3: Semantic definitions of XPath basic operators 
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J^lid : nset -^ nsetl{S) := IJt,e5-^P'^l('^^^'^'^^('^)) 
J-'lid : str -^ nsetl{s) := derefJds{s) 

J-'lnumber : str -^ numl{s) := toj%umber{s) 
J-'lnumber : bool -^ numl{b) := if 6 = true then 1 else 
J-'lnumber : nset — * nurnKS) := J-'lnumberJ{J-'lstringl{S)) 
J^lsum : nset — * numKS) := Xlt,e5^^-'^'""^^6^('^^^'^'^^('^)) 

J-'lconstant string s :^ strJO := s 

J-'lstring : num -^ strl{n) := to.string{n) 

J^\string : nset — > str\{S) := if S* = then "" else strval{first^^^^{S)) 

J-'lstring : bool -^ str] (6) := if 6 = true then "true" else "false" 

Figure A. 4: Semantic definitions of XPath additional operators 



Appendix B 



Declarative debugging of XPath 
queries with DescribeX 



In this appendix, we present DescribeX-Eclipse, a visual interactive tool for exploring 
XML collections and explaining XPath queries. DescribeX-Eclipse is built on top of the 
DescribeX engine implementation presented in Chapter [7] and includes a GUI and addi- 
tional XML retrieval tools implemented by other colleagues |ACKR08] . This visual tool 
provides additional evidence of the wide-range of applications the DescribeX framework 
has in the area of XML processing. 

DescribeX-Eclipse is written in Java for the Eclipsq^plug-in framework and its existing 
tools, views, and editors. Eclipse is a popular open source platform built by an open com- 
munity of tool providers. DescribeX-Eclipse is also integrated with the XPlainer- Eclipse 
plug-in |CLR07j . and fully supports declarative debugging of any arbitrary XPath engine, 
including implementation dependant intermediate results. XPlainer-Eclipse extends the 
XML and XPath development facilities available in the Eclipse environment with the abil- 
ity to support explanation queries. The DescribeX-Eclipse tool extends XPlainer-Eclipse 
with the structural description capabilities of the DescribeX framework. 

^http : //www . eclipse . org/ 
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Figure B.l: DescribeX-Eclipse user interface 
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DescribeX-Eclipse allows developers to navigate between different views of the local 
and global structure of large (multi-gigabyte size) collections in order to obtain enough 
structural information for writing and debugging XPath queries. The graph based vi- 
sualization employed by DescribeX-Eclipse makes it straightforward to see the different 
path structures that are present in the collection. DescribeX-Eclipse functionality helps 
a user in quickly understanding what parts of a collection schema (if present) are used 
in practice. 

In order to explain how DescribeX-Eclipse works, let us go back to our developer, 
Sue, who is trying to aggregate podcasts that are part of a series. In a series feed, items 
are sorted from the newest (the first) to the oldest (the last). The feed may span several 
days or weeks, and there might be more than one item per day. In particular. Sue is 
interested now in items containing pubDate, link and enclosure elements. In addition, 
she aggregates the item(s) of the latest day in the series separately from the rest. For 
obtaining all items that do not belong to the latest day she runs the query 

Q4 = /rss/channel [iteiii[position()>l]] /item[link] [enclosure] 
[not (pubDate= . . /item [1] /pubDate) ] 

which returns all items containing link, enclosure and pubDate from previous days. 
Using DescribeX-Eclipse Sue can create an SD like the one shown in the screenshot 



of Figure B.l The screenshot shows seven views, and the SD graph is displayed in the 
DescribeXEditor (view (1)) and outline (view (4)). The outline view shows the entire 
SD graph with the fragment that appears in the DescribeXEditor highlighted in a light 
blue box. 



The SD of Figure |B.1| has a node for each item that has a different substructure. 
The edges represent c axis relations between elements. The fragment of the SD graph 
displayed in view (1) tells the developer there are three kinds of items in the collection, 
each one containing a different combination of elements. For instance, the third item 
node in the SD (in yellow) has title, description, pubDate, duration, guid, enclosure and 
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link (all in yellow) and represents all item elements in the collection with that particular 
structure. 

Since the behavior of a query is instance dependant, in order to debug the query 
effectively Sue needs to run it on all candidate documents. A document is a candidate 
for a query Q when it returns a non-empty answer for the structural sub query of Q. 

A visual explanation, which shows the XPath result and intermediate nodes, is pro- 
vided by views (2) and (6). Given an XPath query and an input XML document, an 
explanation of the query gives as answer all the XPath result nodes together with in- 
termediate nodes. The intermediate nodes are those nodes resulting from the partial 
evaluation of the subexpressions of the original XPath query that contribute to the an- 
swer. Obtaining the explanation of a complex XPath query can be challenging, as shown 
in the example. Visual explanations provide a representation of the basic mechanism at 
play during XPath processing. 

Views (2) and (6), together with view (5), correspond to the XPlainer- Eclipse plug-in 
[CLR07J . The XPlainer Editor (view (2)) shows one of the candidate documents with 
an explanation of Q4. View (5) displays the query and the number of elements in the 
answer. View (6) displays the XPlainer tree, a particular parse tree for the query that 
provides an intuitive representation of its structure. Each node in the XPlainer tree 
corresponds to one step or predicate in Q4. The intermediate document nodes of each 
step are identified by the same sequence number in both XPlainer tree and the XPlainer 
Editor. For instance, item nodes (4.1), (4.2) and (4.3) (view (2)) are the answer of the 
query, which corresponds to step (4) in the XPlainer tree (view (6)). 

Since current XPath query evaluation tools do not provide intermediate nodes, the 
only available debugging techniques involve either partial evaluation of subexpressions or 
evaluating reversed axis. A partial evaluation cannot see beyond the current evaluation 
step, so it has no way of filtering out nodes that will have no effect in the final answer. For 
instance, a partial evaluation of the /rss/ channel subexpression would return all channels 
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below rss elements, including those that do not satisfy the [item[position{) > 1]] predicate 
and the rest of the query. An evaluation that reverses the axis will not necessarily 
give us exactly the intermediate nodes either when recursive axes like descendant or 
following are involved. Thus, visual explanations are necessary in order to obtain the 
exact set of intermediate nodes that contribute to the answer. An in-depth study of 
visual explanations can be found in |CLR07] . 

The DXFileListView (view (7)) lists the documents in the extent of the active node 
(the yellow item in the DescribeXEditor) that are also candidates for the query shown 
in View (5). The documents highlighted in the DXFileListView are the explanation doc- 
uments, i.e. those candidates that satisfy the complete query. The notions of candidate 
an explanation documents are key to the integration of XPlainer into the DescribeX 



framework (see Chapter 6.4). The developer can then open any candidate or explanation 
document in the DXFileListView with the XPlainer Editor and obtain explanations of 
either Q4 or different relaxations of Q4. 

A relaxation of a query is obtained by selectively collapsing portions of the XPath 
expression to eliminate constraints. This is useful when there are no answers to a complete 
query, but then after removing constraints the relaxed query can be satisfied. A very 
useful relaxation is the one that removes all non-structural predicates from a query Q 
thus obtaining its structural subquery. 

A very important property of DescribeX-Eclipse is that it is not tied to any particular 
XPath implementation. Instead, an arbitrary XPath evaluator can be invoked through a 
standard interface. This is a critical engineering decision that allows DescribeX-Eclipse 
to provide explanations for different XPath engines. Beyond differences in the capabilities 
of the implementations, the XPath language itself has several areas where the semantics 
are implementation defined. This effectively means that only the original XPath engine 
can explain one of its own implementation defined features. 



Appendix B. Declarative debugging of XPath queries with DescribeXISO 

With the DescribeX-Echpse tool, debugging and exploration complement each other: 
Sue can decide interactively to get different descriptions of the collection by changing the 
SD definition or obtain more or less strict visual explanations by relaxing a query in differ- 
ent ways. Thus, DescribeX-Eclipse provides Sue with a flexible, integrated environment 
for understanding a collection and the queries she needs to run on it. 



